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NOTES ON NOTATION AND ON TABLES FOR 
FACILITATING STATISTICAL WORK 


A. Notation 

The reader is assumed to be familiar with the commoner mathematical 
signs, e.g. those for addition and multiplication. We shall also employ 
the following symbols, all of which are in general use — 

The factorial sign 

The symbol « !, read " factorial means the number 

Ix2x3x ... X (h— 2)x(n— l)xn 

Factorial n is by some writers expressed by the symbol but this 
notation appears to be falling out of use in favour of n I, probably owing 
to the greater ease with which the latter form can be printed and t3rpe- 
written. 

The combinatorial sign 

The symbol "C,. means the number of ways in which r things can be 
chosen from n things, e.g., ^^Cu is the number of ways in which a hand 
of cards can be dealt from an ordinary pack of 52 cards. 

In most textbooks on algebra it is shown that 


A more modern symbol is 

and we shall use this form occasionally. 

The summation sign 

r=c|i 

The sum of n numbers Xj, .r» is written L [xt), read " sum Xt 

f=i 

from one to i.e. 

E • • • +*(«-!)+*<• 

r-l 

Where no ambiguity is likely to arise, the suffix r and the limits 
written above and below E are omitted, e.g. the above sum would be 
written simply S(x), it being understood from the context that the 
summation extends over the n values. 

Many writers use the Roman letter S instead of S. 
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The Greek alphabet 

As the letters of the Greek alphabet will often be used as symbols, we 
give for convenience the names of those letters. 


Small 

letter 

Capital 

letter 

Name 

Small 

letter 

Capital 

letter 

Name 
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B. Calculating Tables 




For heavy arithmetical work a calculating machine is invaluable ; 
but ov/ing to their cost machines are, as a rule, beyond the reach of the 
student. 

For a great deal of simple work, especially work not intended for 
publication, the student will find a slide rule exceedingly useful : par- 
ticulars and prices will be found in any instrument-maker’s catalogue. 
For greater exactness in multiplying or dividing, logarithms are almost 
essential. 

The student will derive invaluable aid from Barlow's Tables of Squares, 
Cubes, Square-roots, Cube-roots, and Reciprocals of all Intef^ral Numbers 
up to 10,000 (E. & F. N. Spon, London and New York), which are useful 
over a wide range of statistical work. 

C. Special Tables of Functions useful in Statistical Work 

The tables at the end of this book will cover most of the student's 
ordinary requirements. The more advanced student will find it useful 
to have Tables for Statisticians a^id Biometricians (Cambridge University 
Press) — particularly Part L Research workers will wish to have Fisher 
and Yates* Statistical Tables for Biological, Agricultural and Medical 
Research f Oliver and Boyd). 

D. References to the Text 

Each section in the book is distinguished by a number in heavy type 
consisting of the number of the chapter in which the section occurs 
prefixed to the number of the section in that chapter and separated from 
it by a period ; e.g., 7.13 means the thirteenth section of Chapter 7, and 
10.1 refers to the first section of Chapter 10. The Introduction, which 



THEORY OF STATISTICS 


XI 


precedes Chapter 1, is for this purpose regarded as Chapter 0, e.g., 0.26 
refers to the twenty-sixth section of the Introduction. References to 
sections are given simply by the number of the sections, e.g., “ We saw 
in 8.3 ” means " We saw in the third section of Chapter 8.” 

Similarly, equations, tables, examples, exercises, diagrams and references 
are distinguished first of all with the number of the chapter in which they 
occur and then, separated by a period, with their serial number within 
the chapter, e.g., “ Table 6.7 " refers to the seventh table in Chapter 6, 
and ‘‘ Equation (17.8) ” refers to the eighth equation of Chapter 17. 
These figures are in ordinary type. 

This simple notation saves a good deal of unnecessary wording. To 
facilitate quickness of reference we sometimes give pages as well. 

A distinction is drawn between examples, which are given in the text 
for purposes of illustration, and exercises, which are set at the end of the 
chapter for the student to work out for himself. 
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Number and measurement 

0.1 Western civilisation is pervaded by ideas of number and measure- 
ment. Even the events of our everyday life are inextricably bound up 
with them. We have only to picture a race which cannot count or measure 
trying to run the Bank of England or control the milk market, or even 
understand the sporting columns of the daily press, to realise how deeply 
rooted numbers are in the complex activities of the modem world. 

0.2 Science itself is particularly indebted to numerical expression. 
As organised knowledge has increased, the necessity for precision has 
become greater, and in the formulation of precise statements number and 
measurement have played a leading part. The desire for quantitative 
expression was first felt in the physical sciences, but it has now spread into 
nearly all branches of knowledge. The movement is by no means com- 
plete, how’ever, and may be seen at work to-day. As a significant instance 
we may note that courageous attempts are being made to subject the 
process of thought itself — that last stronghold of the contentious and the 
mysterious — to quantitative inquiry. 

0.3 Many people, in fact, have been led by their enthusiasm for 
numerical data to regard knowledge of a non-quantitative kind as hardly 
deserving the name “ knowledge ” at all. Towards the close of the nine- 
teenth century it was possible for Lord Kelvin to say : " When you can 
measure what you are speaking about and express it in numbers you know 
something about it ; but when you cannot measure it, when you cannot 
express it in numbers, your knowledge is of a meagre and unsatisfactory 
kind.” This remark has often been quoted with an approval which it does 
not altogether deserve — it does not, for example, do justice to the work of 
Darwin and Pasteur, to name only two of Kelvin’s contemporaries. But 
there can be no den 3 dng that it expresses a point of view which many 
people will endorse. 

Numerical data 

0.4 The desire for precision, in fact, leads investigators of all kinds, 
from the atomic physicist to the business man, to express the facts about 
; that part of the universe which interests them in a quantitative way. 
I' Numerical data have come into being not only in the laboratory and the 
study, but in the counting-house, the sales department, the Board Room 
and the legislative assembly. It is difficult to see how our society could be 

xiii 
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organised without them. Where the Jews and the Romans were content 
with occasional censuses for military or fiscal purposes, ^ the progressive 
modern state finds itself under the necessity of keeping a close and quanti- 
tative eye on all that goes on within or without its frontier. A country 
which does not do so may be fairly regarded as backward. In a typical 
phrase, Anatole France summed up this point of view when he said of the 
Chinese: "Tant qu*ils ne se seront pas compt6s, ils ne compteront pas'* — 
if they don't count they won't count. 

Statistics concerned with numerical data 

0.5 There are certain features of numerical data, no matter in what 
branch of knowledge they originate, which may call for a special type of 
scientific method to treat them and elucidate them. This is known as 
" Statistical method," or more briefly, as " Statistics." It does not, 
however, embrace the study of numerical data of every kind, and before 
we attempt a formal definition of its nature and scope, it is necessary to 
give some words of explanation. 

Effects and causes 

0.6 One of the principal aims of Science is to trace, amidst the tangled 
complex of the external world, the operation of W'hat arc caUe»l " laws " — 
to interpret a multiplicity of natural phenomena in terms of a few funda- 
mental principles. A knowledge of the operation of these laws enal)les us 
to talk of " cause " and " effect," The metaphysical problems associated 
\vith these words need not detain us, but since in the sequel we shall often 
use them, it is proper to explain that we adopt them as a convenient way 
of expressing serviceable and familiar ideas. We shall be dealing with 
the everyday world, where " law " and " cause " have significant and 
important connotations. 

0.7 With this convention, we may say that any physical event, and 
in particular that described by quantitative data, is produced by the 
operation of one or more causes. I'he number of causes which produce any 
particular effect may be, and usually is, extremely large. For instance, 
the height of a man is causally linked with his race, his ancestry, his 
habitation, his diet during youth, his age. his occupation, and at any given 
moment even with Ids position and the time of day. 

0.8 Experiment, the great weapon of scientific inquiry, derives it.s power 
from the ability of the experimenter to replace sucli complex systems of 
causation by simple systems in which only one causal circumstance is 

^ David (II Samuel, 24) numbered the people of Israel and called down a plague by 
doing so. He counted 800,000 valiant men who drew the sword, and though the text 
IS not entirely clear it seems likely that Divine disapproval was directed against the 
milit^i^c purpose of the census, not the census itself. We are told later that 70 000 
^ died of the resulting pestilence, so it looks as if there was no ban on counting dead 
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allowed to vary at a time. This is perhaps an ideal, but it is one which 
is closely approached with the technique of modern laboratory practice. 

0.9 Let us, howevei, turn for a moment to social science, as the parent 
of the methods termed “ statistical, “ and consider its characteristics as 
compared, say, with physics or chemistry. One characteristic stands out 
so markedly that attention has been repeatedly directed to it by 
“ statistical " writers as the source of the peculiar difficulties of their 
science — the observer of social facts cannot experiment^ hut must deal with 
circumstances as they occur, apart from his control. The simplification open 
to the experimenter being impossible, the observer has, in general, to deal 
with highly complicated cases of multiple causation — cases in which a 
given result may be due to any one of a number of alternative causes or 
to a number of difleient causes acting conjointly. 

0.10 A little consideration will show that this is also characteristic of 
observations in other fields. The meteorologist, for example, is in almost 
precisely the same position as the student of social science. He can 
experiment on minor points, but the records of the barometer, thermo- 
meter and rain gauge have to be treated as they stand. With the biologist, 
matters are somewhat better. He can and does apply experimental 
methods to a very large extent, but frequently cannot approximate closely 
to the experimental ideal ; the internal circumstances of animals and plants 
too easily evade complete control. Hence a large field (notably the study 
of variation and heredity) is left in w'hich methods of experiment have to 
be supplemented by other methods. The physicist and chemist, finally, 
stand at ibe other extremity of the scale. Theirs are the sciences in which 
experiment has been brought to its greatest perfection. But even so, there 
is still scope for the application of statistical treatment in these sciences. 
The methods available for eliminating the effect of disturbing circumstances, 
though continually improved, are not, and cannot be, absolutely perfect. 
The observer himself, as well as the observing instrument, is a source of 
error ; the effects of changes of temperature, or of moisture, or pressure, 
and draughts, vibration, etc., cannot be completely eliminated. 

0.11 It is with data affected by numerous causes that Statistics is mainly 
concerned. Experiment seeks to disentangle a complex of causes by 
removing all but one of them, or rather by concentrating on the study 
of one and reducing the others, as far as circumstances permit, to a com- 
paratively small residuum. Statistics, denied this resource, must accept 
for analysis data subject to the influence of a host of causes, and must 
try to discover from the data themselves which causes are the important 
ones and how much of the observed effect is due to the operation of each. 

f|)efinitions 

{f.l2 In the light of the foregoing discussion we may accordingly give 
‘ ♦he following definitions — 
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By Statistics we mean quantitative data affected to a marked extent 
by a multiplicity of causes. 

By Statistical Methods we mean methods specially adapted to the 
elucidation of quantitative data affected by a multiplicity of causes. 

By Theory of Statistics or, more briefly, Statistics we mean the 
exposition of statistical methods, 

(It will be observed that the same word may be used both for the 
science and for the raw material on which it works. This dual use 
gives rise to no confusion in practice, but the distinction is worth bearing 
in mind.) 

Use of •• statistic ” 

0.13 This is perhaps the appropriate place to remark that there has 
recenth^ come into use the singular form “ statistic/' This is the name 
given to a particular kind of estimate compiled from observations, usually 
according to some algebraical formula. In this book we shall not meet 
the term until we reach the theory of sampling (Chapter 18) and shall 
there use it in a restricted sense. 

History of the word statistics ** 

0.14 In their present meaning the words statistics/’ “ statistician " 
and statistical " are barely a century old. They have, however, been 
in use longer than that, and it is instructive to consider the process by 
which they have reached their present meaning. 

0.15 The words statist," " statistics," " statistical," appear to be 
all derived, more or less indirectly, from the Latin status, in the sense, 
acquired in mediaeval Latin, of a political State, 

0.16 The first term is, how^ever, of much earlier date than the two others. 
The word " statist " is found, for instance, in Hamlet (1602)S Cymbeline 
(1610 or 161 1),* and in Paradise Regained (1671).® The earliest occurrence 
of the word " statistics " yet noted is in The Elements of Universal 
Erudition, by Baron J. F. von Bieifeld, translated by \V. Hooper, M.D, 
(3 vols., London, 1770). One of its chapters is entitled Statistics, and 
contains a definition of the subject as " The science that teaches us what is 
the political arrangement of all the modern states of the known world." ♦ 
" Stat^tics " occurs again with a rather wider definition in the preface to 
A Political Survey of the Presefit State of Europe by E. A. W. Zimmermann,® 

» Act 5. SC.2. » Act 2. sc. 4 . * Bk. 4. 

* We cite from Dr W. F. Willcox. Quarterly Publications of the American Statistical 
Assoctahon, vol. 14, 1914, p, 287. 

* Zimmermann s work appears to have been written in English , though he was a 
German and Professor of Natural Philosophy at Brunswick. 
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issued in 1787. " It is about forty years ago/' says Zimmermann, " that 
that branch of political knowledge, which has for its object the actual and 
relative power of the several modem states, the power arising from their 
natural advantages, the industry and civilisation of their inhabitants, and 
the wisdom of their governments, has been formed, chiefly by German 
writers, into a separate science. ... By the more convenient form it has 
now received . . . this science, distinguished by the new-coined name of 
statistics, is become a favourite study in Germany " (p. ii) ; and the 
adjective is also given (p. v) : To the several articles contained in this 
work, some respectable statistical writers have added a view of the 
principal epochas of the history of each country.'* 

0.17 Within the next few years the words were adopted by several 
writers, notably by Sir John Sinclair, the editor and organiser of the first 
Statistical Account of Scotland} to whom, indeed, their introduction has 
been frequently ascribed. In the circular letter to the Clergy of the Church 
of Scotland, issued in May 1790,* he states that in Germany “ ' Statistical 
Inquiries,* as they are called, have been carried to a very great extent,'* 
and adds an explanatory footnote to the phrase “ Statistical Inquiries " — 
"or inquiries respecting the population, the political circumstances, the pro- 
ductions of a country, and other matters of state." In the " History of the 
Origin and Progress of the work, he tells us, " Many people were at first 
surprised at my using the new words, Statistics and Statistical, as it was 
supposed that some term in our own language might have expressed the 
same meaning. But in the course of a very extensive tour, through the 
northern parts of Europe, which I happened to take in 1786, 1 found that in 
Germany they were engaged in a species of political inquiry, to which they 
had given the name of Statistics ;* ... as I thought that a new word might 
attract more public attention, I resolved on adopting it, and I hope that it 
is now completely naturalised and incorporated with our language." This 
hope was certainly justified, but the meaning of the word underwent rapid 
development during the half-century or so following its introduction. 

0.18 " Statistics " (statistik), as the term was used by German writers 
of the eighteenth century, by Zimraermann and by Sir John Sinclair, 
meant simply the exposition of the noteworthy characteristics of a state, 
the mode of exposition being — almost inevitably at that time — ^pre- 
ponderantly verbal. The conciseness and definite character of numerical 

^ Twenty-one vols., 1791-99. 

* Statistical Account, vol. 20, Appendix to ** The History of the Origin and Progreis 
. . . given at the end of the volume. 

* Loc, dt,, p, xiii, 

* The Abriss d$f Staatswissenschaft der EuropMschen Reiche (1749) of Gottfried 
Achenw^l, Professor of Politics at Gdttingen, is the volume in which the word 
* statistik appears to be first employed, but the adjective ** statisticus *' occurs at a 
somewhat earlier date in works written in Latin. 
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data were recognised at a comparatively early period — more particularly 
by English writers — but trustworthy figures were scarce. After the 
commencement of the nineteenth century, however, the growth of official 
data was continuous, and numerical statements, accordingly, began more 
and more to displace the verbal descriptions of earlier days. Statistics " 
thus insensibly acquired a narrower signification, viz. the exposition of 
the characteristics of a State by numerical methods. It is difficult to 
say at what epoch the word came definitely to bear this quantitative 
meaning, but the transition appears to have been only half accomplished 
even alter the foundation of the Royal Statistical Society in 1834. The 
articles in the hist volume of the Journal, issued in 1838-39’;^ are for the 
most part of a numerical cliaracter, but the official definition has no 
reference to method. Statistics,"' we read, “ may be said, in the words 
of the prospectus of this Society, to be the ascertaining and bringing 
together of those facts which are calculated to illustrate the condition 
and prospects of society.” It is, however, admitted that ” the statist 
commonly prefers to employ figures and tabular exhibitions.” 

0,19 Once the first change of meaning w'as accomplished, further 
changes foUow^ed. From the name of a science, the w^ord was transferred 
to those series of figures on which it operated, so that one spoke of vital 
statistics, shipping statistics, and so on. It was then applied to the 
similar numerical data which occuired in other sciences, such as anthro- 
pology and meteorology. By the end of the nineteenth century we find 
'"statistics of mental characteristics in man,” "‘statistics of children 
under the headings bright-average-dull,” and even an examination of 
the characteristics of the Virgilian hexameter with statistics.” The 
development of the meaning of the adjective ” statistical ” and the noun 
"" statistician ” was naturally similar, 

0.20 Perhaps the most abstract use of the word occurs in the theory 
of thermodynamics, wherein one speaks of entropy as proportional to the 
logarithm of the statistical probrjbility of the universe—^, definition which 
no statesman would be unwilling to admit to lie completely outside his 
purview. But it is unnecessary to multiply instances to show that the 
word ” statistics ” is now entirely divorced from ” matters of State.” 

The theory of statistics 

0.21 The theory of statistics as a distinct brancli of scientific method 
is of comparatively recent growth. Its roots may be traced in the work 
of Laplace and Gauss on the theory of errors of observation, but the 
study itself did not begin to flourish until the last quarter of the nineteenth 
century. Under the influence of Galton and Karl Pearson remarkable 
progress was made, and the foundations of the subject w»ere laid in the 
next thirty years*— as it has turned out, very securely. The subject has 
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not, however, yet reached a stage whereat a cut>and-dried exposition of 
its methods can be given. Research, particularly into the mathematical 
theory of statistics, is rapidly proceeding, and fresh discoveries are being 
made with a rapidity which makes it difficult to keep pace with them. 
It may, however, help the student to appreciate the work of later chapters 
if we sketch in brief general terms the field of statistical theory as it now 
exists. 

The collection of data 

0.22 The first question which the statistician has to consider is the 
collection and assembling of his data. In many fields, such as economics 
and sociology, he cannot prepare the data himself but has to get what 
he can from such sources as official statistics, which are usually prepared 
with an object differing from his own. Such information is therefore 
rarely all that one could wash. Investigator A, studying the sugar 
market, finds that the official figures run cane and beet sugar together. 
Investigator B, wanting to compare prices over a period of years, finds 
that during the war period 1939-1945 there is a gap in the information. 
Investigator C, wishing to study poverty, has to content himself with 
indirect figures such as those of wage levels and unemployment. But 
however incomplete the data may be, and however tangentially pertinent 
to his inquiry, the investigator must take what he can get and be thankful. 

0.23 In other cases, and particularly in meteorology, biology and 
psychology, he can produce his own data or borrow those of other investi- 
gators similarly engaged. He does not merely take his figures from some 
source or other ; he is instrumental in their production, and within limits 
can control their nature so as to bring them to bear directly on his inquiry. 

It might be thought that the only qualities required for such work are 
an ability to count or measure and a reasonable care But this is not so. 
Once outside the laboratory the investigator is beset with a swarm of 
practical difficulties. We might illustrate the point by referring to the 
troubles of an investigator who wished to find out how many dairy cows 
there were in a certain parish. He took the simplest course and went to 
all the farms in the parish and asked the occupier how many cows he had. 
Farmer A said that he had fifteen, but had sold eight and was waiting 
for the buyer to come and fetch them. Farmer B had about twenty." 
Farmer C obviously could not be bothered and said the first figure wffiich 
came into his head ; and so on. It is clear that the result of such an 
; inquiry would be to give a quite illusory figure. One of the duties of the 
; practising statistician is to design his inquiries so as to minimise this kind 
g,of error. 

;p.24 A full discussion of such matters lies outside the scope of this 
book, but we have given them more than a passing mention in order to 
^introduce one very necessary caution. 
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The reliability of data must always be examined before any attempt 
is made to base conclusions on them. This is true of all data, but 
particularly so of numerical data, which do not carry their quality written 
large upon them. It is a waste of time to apply the refined theoretical 
methods of statistics to data which are suspect from the beginning. 

The treatment of data 

0.25 Having obtained his data and satisfied himself that they are 
reliable enough to permit him to proceed, the statistician must then lick 
them into shape."' He must decide on some form of arrangement and 
presentation, reduce them to a convenient scale of units, add so on ; in 
short, he must work on his raw material until it is ready for the application 
of his prepared tools 

0.26 The only process of treatment to which attention need be called 
is that of condensation. The mind is incapable of grasping the significance 
of a large mass of figures. If, therefore, the quantity of data available 
is of any size, some process of condensation is necessary to enable the 
mind to appreciate the picture which the data represent. 

Suppose, for instance, we are discussing the stature of a thousand men, 
and have as data the height of each man to the nearest inch. Our raw 
material then consists of a thousand sets of figures ranging from four feet 
to seven feet, or thereabouts. Only the supermind could look over these 
figures and grasp their essentials. Nor would the position be met by 
rearranging the figures in order of magnitude. To get a clear picture of 
the situation some condensation is necessary, and in this case it can be 
carried out easily by grouping together all the men whose heights lie in a 
certain range, say of three inches. Our total range of three feet is then 
replaced by twelve sub-ranges, each of three inches, and we may 
summarise the data by giving the numbers of men who fall into the twelve 
sub-ranges. In short, we have replaced our original thousand figures by 
twelve. 

0.27 It will be clear that in so doing w'e have sacrificed a certain 
amount of information. Twelve figures cannot possibly tell us as much as 
a thousand. It may very well be, however, that the information in the 
twelve is all that we require ; the lost information may be irrelevant to 
the inquiry. Such a case would happen if we wanted to know, to an inch 
or so, what was the height exhibited by the greatest number of men. 

0.28 The process of condensation thus sacrifices information but gives 
us instead a very necessary clarity and adaptability for manipulation. 
How far the process is carried in any particular case will depend on how far 
the disadvantages of the sacrifice are offset by the advantages of the 
clarity. 
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Summarising and descriptive statistics 

0.29 The process of summarising which we have just described may 
be carried a great deal further, and leads to a branch of theory which has 
very important practical apph cations. 

The reader is probably familiar already with the idea of an average 
value,'* and with its use in compressing into a single number the results of 
a series of observations. Such quantities are, in fact, the result of sum- 
marising to the greatest possible extent ; they are summaries in which the 
statistician has distilled the information of a diffuse mass of figures into a 
single drop, so to speak. 

0.30 There is a wide demand for such summarising numbers, and a 
good deal of this book will be devoted to considering them from one aspect 
or another. They give a convenient bird’s-eye view of what is sometimes 
a complex and confusing whole. Special sciences have evolved special 
quantities of this type to meet their own needs. For instance, the econo- 
mist has invented various kinds of index numbers to express in a short- 
hand way comph'cated changes in prices; and the psychologist has devised 
coefficients to expre.ss the reactions of an individual mind to a sequence of 
tests. 

0.31 The remarks we made in 0.27 and 0.28 apply here with additional 
force. It must never be forgotten that in summarising we omit. Part of 
the statistician’s task is to see that we do not omit too much. 

0,32 The problem of describing a complicated set of data in as few 
terms as possible is facilitated by the use of mathematical functions. 
Suppose, for instance, that in the thousand men of 0.26 we assumed that 
the number of men {y) of height x inches varied as the square of x — 
frankly a most improbable result, but one which will serve for the purposes 
of illustration. Then we may describe the data completely by an equation 
of the form — 

‘ y^ax^ 

where a is a constant to be determined from the data. Knowing a we can 
find the number of men of any given height. 

i),33 In this case it rather looks as if we have condensed aU the 
information into a single number a without losing any of it. But that is 
luot so. What we have done is to replace the set of a thousand figures by 
%n assumption about their nature. We have lost none of the information 
Ibecause we assumed, in using the equation, that the information was of 
1^ type known to us already. 

||«34 It is found in practice that many sets of data may be very con- 
Ifenientlv expressed by mathematical functions. The question as to which 
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functions are the most snitable for purposes of description leads to some 
interesting theory, some of which will be dealt with later and some of which 
is of an advanced character lying outside the scope of an Introduction to 
the Theory of Statistics, Such functions are particularly helpful in the 
theory of sampling. 

Analysis of data 

0.35 When the statistician has arranged and compressed his data into 
a suitable form, or decided on the functions and evaluated the quantities 
which he has chosen to describe them, the first stage of bis inquiry is 
finished. It may be that he would wish to take it no further ;\ for instance, 
if he is preparing an index number for the economist he may wish to hand 
over the number to that person without comment, for him to make such 
use of it as he thinks fit. More frequently, however, he has prepared the 
data for his own use as a statistician. He then proceeds to the next 
stage, that of analysis and elucidation of the causal system which gave rise 
to them. 

0.36 The methods for such purposes are very numerous. In this 
brief review wc need only point out the importance of the investigation of 
relativnship, the theory of which bulks very large in statistical literature. 
It two events are related there is usually, though not always, some causal 
nexus between them. The problems of the investigation of relationship 
between phenomena lead to the theory of dependence, contingency and 
correlation, and the formulation of various coefficients to measure the 
extent to which one set of events depends upon another. 

Sampling 

0.37 When we wish to discuss the properties of an aggregate we may 
be prevented by practical or theoretical reasons from examining every 
single member of it. For example, in considering the stature of the male 
inhabitants of the United Kingdom we cannot measure every man, 
because of the time and trouble involved ; and in considering the scores 
of a roulette wheel we cannot examine every score, because the number 
is practically infinite and observations can be continued as long as the 
wheel lasts, 

0.38 We do not despair, nevertheless, of being able to gain some 
knowledge of the aggregate. Where we cannot take the whole we do the 
best we can and try to obtain a selection of members. This selection is 
called a sample. 

0.39 It is clear that a sample will not tell us everything about the 
parent aggregate from which it is derived. Nevertheless, most people have 
a feeling, and we shall see later in this book that under certain conditions 
the feeling is a justifiable one that the sample will give us some information 
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aboiit the patent. Values calculated {torn the sample may be taken to be 
estimates of values in the parent, to a degree of approximation which 
becomes closer as the sample gets larger; and even where the sample is 
small we can sometimes draw inferences of a general nature about the 
parent. 

0.40 We are rarely, if ever, able to reason from the sample to the parent 
with the categorical certainty of a mathematical proof. Our inferences 
will usually be expressed in terms of probabilities. Moreover, we shall find 
it much easier to reject a hypothesis than to accept it. Our inferences 
will generally be not of the type the hypothesis H is true/' or even 
*'the hypothesis H is probably true/' but of the type '‘hypotheses A, 
B and C are probably untrue, but we see no reason to doubt hypothesis 

nr 

For example, suppose we take a sample of a thousand men from the 
population of the United Kingdom and find their average height to be 
5 ft 8 in. What can we say about the average height of the population as 
a whole ? We cannot give it with any certainty. We cannot ev^n say, 
with certainty, that it hes within, say, one inch of 5 ft 8 in. What we can 
sajr, assuming that the sampling technique is sound, will be something to 
the effect that a hypothesis which supposes that the mean of the whole 
population is greater than 5 ft 9 in. or less than 5 ft 7 in. is probably 
incorrect, but that the data are consistent with the supposition that the 
mean lies between those limits. 

0.41 The theory of sampling is thus closely bound up with the theory 
of probability. The many problems which arise in this connection are 
among the most interesting and at times the most difficult which science 
and philosophy can offer. It is only fair to warn the student that there 
still exists an important difference of opinion among scientific men about 
the validity of certain types of statistical inference. In this book we have, 
so far as we could, avoided these contftitious matters, but the advanced 
student will have to be prepared to face them sooner or later. 

The popular attitude towards statistics 

0.42 Finally, to conclude this introduction we may, perhaps, refer to 
the popular mistrust of statistics and statistical methods. 

The layman's attitude towards statistics is admirably summed up in 
the remark that mankind is divided into two parts, those who say that 
/figures can prove anything and those who assert that they can prove 
nothing. It must be admitted that this attitude is not unreasonable. 
^.From the advertisement hoarding, from the electioneering platform, from 
|jhe partisan press, and from a dozen other sources, the man in the street is 
||)ombarded with tendentious figures put forward to support some ex parte 
|ltatement. Sometimes such figures are justifiably used to form a basis for 
|the arguments which are built upon them ; more often they give a specious 
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crudely classified, as suggested above, and we may be able, by auxiliary 
hypotheses as to the nature of this variable, to draw further conclusions. 
But the methods and principles developed for the case in which the observer 
only notes the presence or absence of attributes are the simplest and most 
fundamental, and are best considered first. This and the next two 
chapters are accordingly devoted to the Theory of Attributes. 

Classification with reference to attributes 

1.3 The objects or individuals that possess the attribute, and those 
that do not possess it, may be said to be members of two distinct classes/' 
the observer “ classifying the population observed. In ‘^he simplest 
case, where attention is paid to one attribute alone, only Wo comple- 
mentary classes are formed. If several attributes are noted, the process 
of classification may, however, be continued indefinitely. Those that do 
and do not possess the first attribute may be reclassified according as they 
do or do not possess the second, the members of each of the sub-classes 
so formed according as they do or do not possess the third, and so on, 
every class being divided into two at each step. Thus the members 
of the population of any district may be classified into males and females ; 
the members of each sex into sane and insane ; the insane males, sane 
males, insane females and sane females into blind and seeing. If we 
were dealing with a number of peas (Pisum sativum) of different varieties, 
they might be classified as tall or dwarf, wdth green seeds or yellow seeds, 
with wrinkled seeds or round seeds, so that we should have eight classes — 
tall with round green seeds, tall with round yellow' seeds, tall with wrinkled 
green seeds, tall with wrinkled yellow' seeds, and four similar classes of 
dwarf plants. 

1.4 It may be noticed that the fact of classification does not necessarily 
imply the existence of either a natural or a clearly defined boundary 
betw'eeri the two classes. The boundary may be wholly arbitrary, e.g., 
where prices are classified as above or below some special value, barometer 
readings^as above or below some particular height. The division may also 
be vague and uncertain : sanity and insanity, sight and blindness, pass into 
each other by such fine gradations that judgments may differ as to the 
class in which a given individual should be entered. The possibility of 
uncertainties of this kind should always be borne in mind in considering 
statistics of attributes : whatever the nature of classification, however, 
natural or artificial, definite or uncertain, the final judgment must be 
decisive ; any one object or individual must be held either to possess the 
given attribute or not. 

Dichotomy 

1.5 A classification of the simple kind considered, in which each class 
is divided into two sub-classes and no more, has been termed by logicians 
dassification, or, to use the more strictly applicable term, division by 
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dichotomy (cutting in two). The classifications of most statistics are not 
dichotomous, for most usually a class is divided into more than two sub- 
classes, but dichotomy is the fundamental case. In Chapter 3 the relation 
of dichotomy to more elaborate {manifold, instead of twofold or dichoto- 
mous) processes of classification, and the methods applicable to some such 
cases, are dealt with briefly. 

1.6 For theoretical purposes it is necessary to have some simple notation 
for the classes formed, and for the numbers of observations assigned 
to each. 

The capitals .4, C, . . . will be used to denote the several attributes. 
An object or individual possessing the attribute A will be termed simply 
A, The class, all the members of which possess the attribute A, will 
be termed the class A. It is convenient to use single symbols also to 
denote the absence of the attributes A, B, C, ,, , We shall employ the 
Greek letters a, fi, y, . Thus if A represents the attribute blindness, 
a represents sight, i.e., non-blindness ; if B stands for deafness, p stands 
for heatifig. Generally ** ct is equivalent to ‘*not-/l, or an object ov 
individual not possessing the attribute A ; the class a is equivalent to ike 
class none of the members of which possesses the attribute .,4. 

1.7 Combinations of* attributes will be represented by juxtapositions 
of letters. Thus if, as above, A represents blindness, B deafness, AB 
represents the combination blindness and deafness. If the presence and 
absence of these attributes be noted, the foui c'^asses so formed, viz. AB 
Afi, olB, oLp, include respectively the blind and deaf, the blind but not deaf, 
the deaf but not blind, and the neither blind nor deaf. If a third attribute 
be noted, e.g. insanity, denoted say by C, the class ABC includes those 
who are at once deaf, blind and insane, A By those who are deaf and blind 
but not insane, and so on. 

Any letter or combination of letters like A, AB, clB, A By, by means 
of which we specify the characters of the members of a class, may be 
termed a class symbol. 

Class-freqaendes 

1.8 The number of observations assigned to any class is termed, for 

brevity, the frequency of the class, or the ** class-frequency/* Class- 
frequencies will be denoted by enclosing the corresponding class-symbols 
^ brackets. Thus, (^4) denote^ the numbs.r of A's, i.e., objects possessing 
attribute A ; (otyJC) denotes the number of i.e. objects possessing 

littribiite C but neither A nor B ; and so on for any number of attributes. 

^der of classes and class-frequencies 

1*9 The classes obtained by noting, say. n attributes fall into natural 
llroups according to the numbers of attributes used to specify the respective 
passes, and these natural groups should be borne in mind in tabulating 
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the class-frequencies, A class specified by r attributes may be spoken of 
as a class of the rth order and its frequency as a frequency of the rth 
order. Thus AB, AC, BC are classes of the second order ; (A), [Afi), 
(aBC), (AByD), class-frequencies of the first, second, third and fourth 
orders respectively, 

1.10 Class frequencies should, in tabulating, be arranged so that 
frequencies of the same order and frequencies belonging to the same 
aggregate are kept together. Thus the frequencies for the case of three 
attributes should be grouped as given below, the whole number of observa- 
tions denoted by the letter N being reckoned as a frequency of order zero, 
since no attributes are specified. 


Order 0 

N 



Order 1 

W 

{B) 

iO 


(a) 


(r) 

Order 2 

[AB) 

(AC) 

(BC) 


m 

(Ay) 

(By) 


{aB) 

(aC) 

iPC) 


(«^) 

(ccy) 

iPy) 

Order 3 

(ABC) 

(aBC) 



(ABy) 

(aBy) 



(A^C) 

(afiC) 



{Afiy) 




( 1 . 1 ) 


The total number of dass-frequ^ncies 

1.11 In such a complete table for the case of three attributes, twenty- 
seven distinct frequencies are given : 1 of order zero, first order, 

12 of the second and 8 of the third. 

In general, for n attributes, there are 3" distinct class- fremencies, if we 
count N as^a frequency of order 0. To demonstrate this, fet us consider 
the number of classes of different orders. 

Of order 0 there is one class N. 

Of order 1 there are 2n classes, for classes of this order contain only one 
symbol, and each of the n attributes contributes two symbols, one of the 
type A and one of the type a. 


Of order 2 there are x 2^ classes, for each class contains two 

symbols, two attributes can be chosen from n m — ^ — ?- ways, and each 


pair gives rise to 2® different frequencies of the types (AB), (AjJ), (aB) 
and (ayS), 

SimU^ly, it may be seen that of order r there are 


n(w-l) ■ . . (n-y +l) ^ 2. 

rl 


classes. 
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Hence, the total number of class-frequencies is 



and this is the binomial expansion of (14-2)’* ==3**. 

It is clear that if n is at all large the number of class-frequencies will be 
very great. For instance if w=6, the number is 729. 


1.12 Fortunately, however, the class-frequencies are not independent 
of one another, and it is not necessary, in order to specify the data com- 
pletely, to give every class-frequency. 

In the first place, let us note the simple result that any class-frequency 
can always be expressed in terms of class- frequencies of higher order. For 
the whole number of observations must clearly be equal to the number of 
il's added to the number of a's, i.e. 

iV-(^)+(a) .... (1.2) 

Similarly, the number of A*s is equal to the number of which are 
B's added to the number of ^’s which are /?'s, i.e. 

{A)=iAB) + {A^) .... (1.3) 

Similarly, 

{AB)^{ABC)+{ABy) > . . . (1.4) 

and so on. 


Ultimate class-frequendes 

1.13 It follows once from tlie result we have just given that every 
class-frequency can be expressed in terms of the frequencies of the highest 
order, i.e., ofiorder n. For any frequency can be analysed into higher 
frequencies, and the process need stop only when we have reached the 
frequencies of the highest order. For example, with three attributeSi 

{A) = (AB)+(Afi) 

=^{ABC)-\-(ABy) + {AfiC) + {Afiy) 

The classes specified by « attributes, i.e. those of the lughest order, are 
termed the ultimate class-frequencies. 

Our result may then be expressed in tlie form : Every class-frequency 
can be expressed as the sum of certain, of the ultimate class-frequencies. To 
specify the data completely it is; therefore, only necessary to give the 
ultimate class-frequencies. ^ 

Example 1.1 — (See F. Warner and otliers, “ Report on the Scientific 
Study of the Mental and Phy.sical Conditions of Childhood,'^' Parkes 
Museum, |895.) A number- of school-children were examined for the 
presence or absence of certain defects of which three chief descriptions 
were noted : yl, development defects ; /f, nerve signs ; C, low nutrition. 
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Given the following ultimate frequencies, find the frequencies of the 
classes defined by the presence of the defects, i.e. those involving the 
Roman letters A, B, C but not the Greek letters a. /?, y, including the 
whole number of observations N — 


{ABC) 

57 

(a.BC) 

78 

(A By) 

281 

(ccBy) 

670 

(A^C) 

86 

(cejSC) 

65 

(Afiy) 

453 

(afiy) 

8310 


The whole number of observations N is equal to the grand total 
10 , 000 . ^ 
The frequency of any first-order class, e.g. (.4), is given by the total of ‘ 
the four third-order frequencies the class-symbols for which contain the 
same letter — 


{ABC) HABr)n (A/iC)-{- {A/)y)-.{A)^877 

Similarly, the frequency of any second-order class, e.g. (AB), is given 
by the total of the two third-order frequencies the class-symbols for whicli 
both contain the same pair of letters — 

(A BC)-ir{A By) (A B) --=338 

The complete results are — 


N 

10,000 

(AB) 

338 

M) 

877 

(.40 

143 

(fi) 

1,086 

(BC) 

135 

iC) 

286 

[ABC) 

57 


The number of ultimate class-firequendes 

1.14 The class-frequencies of highest order each contain « symbols. 
Now each letter corresponding to a particular attribute may be written 
in two ways . A ot cc, B ot etc. Hence the total number of possible 
symbols is 

2x2x2y2x2x2x2x . . . 


and this is the number of ultimate class-frequencies. 

Hence the 3* frequencies may all be expressed in terms of the 2" 
ultimate frequencies. For example, if n=6. the 729 frequencies can 
written m terms of 64 ultimate class-frequencies, which specify the data 
completely. 


The ultimate frequencies are, however, not the only set which specify 
the whole of the data. In fact any set will serve the purpose prorided 
that (u) they are 2» m number, and (6) they are algebraically independent ; 
that IS to say, wlien they are written symbolically no one can be expressed 
in terms of some or all of the otliers. 

We may call such a set of frequencies a fundamental set. 
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Politive attributes 

1.15 The attributes denoted by capitals ABC . . . may be termed 
positive attributes, and their contraries, denoted by Greek letters, negative 
attributes. If a class-symbol includes only capital letters, the class may 
be termed a positive class ; if only Greek letters, a negative class. Thus 
the classes A, AB, ABC are positive classes ; the classes a, etp, apy, 
negative classes. 

If we make a certain dichotomy with regard to a definite attribute A — 
such as male sex, blindness or blue eyes — ^it may be of practical importance 
to note a possible distinction in the nature of the class not-.!. The 
complementary class may, in fact, either be equally definite — female sex, 
ability to see — or it may be a mere heterogeneous remainder, as in our 
last instance — not-blue-eyed, the not-blue-eyed being brown-eyed, grey- 
eyed, or even possessing no eyes at all. 

Logically, this distinction is difficult to maintain, but practically it is 
of some importance. The statistical data in official returns are almost 
always classified according to positive and clearly defined attributes. 
For example, we are given the numbers of persons dying from typhoid, 
not the numbers who did not die of typhoid ; the number of acres under 
grass, not the number of acres not under grass. 

1.16 The positive class-frequencies form a fundamental set in the sense 
of 1.14 ; that is to say, they specify the data ownpletely. They are 
algebraically independent ; no one positive class-frequency can be 
expressed wholly in terms of the others. Their number is, moreover, 2», 
as may be readily seen from the fact that if the Greek letters are struck 
out of the symlk^for the ultimate classes, they become the symbols for 
the positive classes, with the exception of apy ... for which N must be 
substituted. 

Example 1.2. — Given the positive class-frequencies of Example 1.1, to 
find all the class-frequencies. 

The data are — 

iV=10,000; (.4)=877; (B)=1086; (Q=286; (.4B)=338; 

Mq=143; (BC)=135; (.4BC)-=57. 

We have — 

{AB)=^[ABy)^-[ABC) 

or 

338 --(.4 By) -I 57 
i.e. 

(/i By) =281 

Similarlj', from (/IC) and (BC) we find — 

(APC) ^86 
(aBC) =78 
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This gives us the three ultimate class-frequencies which contain only 
one Greek letter. For the others, 

=(C)-(JSC)-(^/?C) 

=286-135 -86 
=65 

Similarly, we have — 

(^yJy)=453 
(ocBy) =670 

Finally, 

(a/?7)=(/?y)-(^Ar) 

={y)-{By)-(,Ah) 

^N-{C)-\{B)-{BC)\-{APy) 

=10,000-286-951-453 

=8310 

We can now calculate any class- frequency by expressing it in terms of 
the ultimate class-frequencies, e.g. 

(a 7 )={afiy)-f-(j:/?y) 

=670-4-8310 

=8980 

1.17 The data encountered in practice are rarely dichotomised according 
to more than three or four variables, and the student should experience 
httle difficulty in expressing any class-frequency in terms of the known 
class-frequencies, either directly, or by first finding^^#; ultimate class- 
frequencies and then expressing the desired frequency in terms of them. 

It is, however, interesting to note the general result that the cl^ 
sjonbols can be treated as operators and multiplied togethei4ike algebraical 
quantities. Let us write ,4.iV for the operation of dichotomising N 
according to A, and write 

A . N=^(A) 

which is the symbolic way of saying that if we dichotomise N according to 
A we get a class-frequency equal to (^4). We can similarly put 

a . iV=(a) 

Adding these two, and putting A. N -fa . N equal to {A -fa) . N, we have — 


so that we may take 


(^-fa). Ar=2V 

i4-f-a=l 


In any symbolic expression we can therefore replace the openutors A ot a 
by 1 -a, 1 , respectively. 

Furthermore, since , (B)=B . (A), we may take the symbol 



THEORY OF ATTRIBUTES 


9 


. N to be the dichotomy of N according to both A and B* and equate 
it to (-4B). A little reflection will show that the operative symbols 
therefore obey the ordinary laws of algebra and in particular may be 
multiplied together. 

For example, we have — 

(afi) . iV=(l -A){\ -B) . N 
^{l^A^B+AB\ .N 

^N-(A)^{B)+{AB). . . . (1.5) 

And, similarly, 

(ayjy) . N 

-(1-^)(1-B)(1-C).N 
-:(1-.4-~-B~-C+^B + BC+AC-^BC) .n 

-^{A)~{B)^{C)+{AB) + [AC) + {BC)~[ABC) . . (1.6) 

Similar results could, of course, be obtained by step-by-step sub- 
stitution ; for instance, 

=N-{A)-{B)->r{AB) 


Consistence 

1.18 Any class-frequencies which have been or might have been observed 
within one and the same population may be said to be consistent with 
one another. They conform with one another, and do not in any way 
conflict. 

The conditions of consistence are some of them simple, but others aie 
by no means of an intuitive character. Suppose, for instance, the following 
data are given- - 


A' 

1000 

(AB) 

42 

(/I) 

525 

(AC) 

147 

(«) 

312 

(BC) 

86 

(Q 

470 

(ABC) 

25 


— there is nothing obviously wrong with the figures. Yet they are 
certainly inconsistent. They might have been observed at different 
times, in different places or on different material, but they cannot have 
been observed in one and the same population. They imply, in fact, a 
negative value for (otyJy) — 

(a/?y ) ---1000-525 -312-470 -|-42 -f 1 47 -i-86 -25 
-^1000-1307-1275-25 

--- - 57 

Clearly no class-frequency can be negative. If the. figures, conse- 
quently, are alleged to be the result of an actual inquiry in a definite 
population, there must have been some miscount or misprint. 
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Condition for consistence 

1.19 It is, in fact, the necessary and sufficient condition for the con- 
sistence of a set of independent class-frequencies that no ultimate class- 
frequency be negative. It is necessary for the obvious reason that no 
class-frequency occurring b}" counting real attributes can be negative ; 
it is sufficient because, given any non-negative set of 2^ numbers, we can 
always imagine a real population with n dichotomies which should have 
these numbers for its ultimate class-frequencies, and it is impossible for 
this real population to give inconsistent results. 

Hence to test the consistence of a set of 2« algebraically independerit 
class-frequencies we need only calculate the ultimate class-frequencies an^ 
ascertain whether any one is negative. If it is, the data are inconsistent. \ 
If no ultimate frequency is negative, the data are consistent. 


1,20 For data given by a heterogeneous collection of class-frequencies, 
consistence is best tested by actually calculating the ultimate frequencies. 
We saw in 1.15, liow(‘\er, that the positive class-frequencies hold a peculiar 
position in that many data encountered in practice are given entirely in 
terms of them alone. It may be useful to consider the consistence 
conditions for this type ot material. 

If two attributes are noted there are four ultimate frequencies (AB), 
(A^), (ocB), (jt/f). Expressing them in terms of po.sitive classes we hnd 
the following conditions— 


{AB) 0 

(AB) >{A)-^r(B)-~N 

iAB) < {B) 


(1.7) 


The third and fourth irierel}' express the tact that the number of members 
which are both .-1 and B must not be greater tlian the number of .4's or 
B’s separately. The second inequality is perhaps not so obvious. 


1.21 For three attributes the conditions that the eight ultimate 
frequencies are not negative will be found to lead to tlie following — 


(ABQ'^O 

(ABC) ^{AB)+{AC)-{A) ^ 

(ABC) ;.{AB)+(BC)~{B) 

{ABC) {AC)-\-{BC}~{C) I 

{ABO^^iAB) 

(ABC) <: {AC) i 

(ABC) < {BC) ( 

{ABC) ^.{AB)-{AC)-r{BC)-iA)~{B) -{C) \-N ) 


( 1 . 8 ) 


(1.9) 


These are not of a new form. They can all be derived from inequalities 
(1.7) by specifying the population ” ; that is to say, by considering one 
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of the inequalities as holding in a sub-population. For instance, from the 
condition (AB) < {A) we have in the population of y's {A By) < [Ay) 
which is equivalent to 


{AB)-[ABC) < [A)--{AC) 
or the second equality of (1.8). 


1.22 If we express the condition that the lower limits to [ABC) given 
by (1.7) must be not greater than the upper limits given by (1.8) we 
obtain 16 further inequalities. All but four of them are of the type 
already found, but there are four new ones — 


[AB)-^[AC) + [BC) ^ {A)^[B)^^[C)^-N 

[AB)+[AC)-^{BC)^[A) 

{AB)^[AC)-V[BC)^[B) 

(AB)^{AC) + {BC) ^ (O ) 


( 1 . 10 ) 


Incomplete data 

1.23 We can now take up the question of the inferences which may be 
drawn from data which, though giving us a certain amount of information 
in the shape of class-frequencies, yet are insufficient to enable us to 
calculate all the class-frequencies. 

The form of the consistence conditions shows that a knowledge of 
certain class-lrequencies allows us to assign limits to others, even though 
we may not be able to find the actual values of those others. The follow- 
ing will serve as Illustrations of the statistical uses of the conditions — 

Example 1.8.- Given that {A)^{B)~^{C)-=iN and 80 per cent of the 
-4's are B'i>, 75 per cent of are C's, find the limits to the percentage 
of IV s that are C's. 


The data are ; 


2(AB)_^ 

N 


2[AC) 


-: 0'75 


and the conditions (I.IO) give-* 


(^) 

ib) 

ic) 

id) 


2(BC)IN. 1 -0-8 -0-75 

0-8+0-75 -1 
1 -0-8 +0-75 

, 1 -i-o-s -0-75 


(a) gives a negative limit and (rf) a limit greater than unity ; hence they 
may be disregarded. From (b) and (c) we have — 


2{BC) 
N ' 


>0-55 


^BC) 
N ' 


0-95 


“that is to say, nt>t less tlian 55 pir cent nor more than 95 pr cent of 
the B's can be C's. 
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Example 1.4. — If a report gives the following frequencies as actually 
observed, show that there must be a misprint or mistake of some sort, and 
that possibly the misprint consists in the dropping of a 1 before the 85 
given as the frequency (BC ) — 


N 1000 


(A) 510 

(AB) 

189 

(B) 490 

(AC) 

140 

(C) 427 

(BC) 

85 

From (1.10) we have — 

(BC) 510-f490-f427 
> 98 

-1000- 

-189-140 


But 85 r: 98 therefore it cannot be the correct value of (BC). 

If we read 185 for 85 all the conditions are fulfilled. 

Example 1.5. — In a certain set of 1000 observations (.4) =45, {B) — 23, 
{C)=14. Show that whatever the percentages of B's that are ^’s and of 
C’s that are /I’s, it cannot be inferred that any B's are C's. 

The first two conditions of (1 . 10) give the lower limit of (BC) which is 
required. We find — 


(BC) ._(AB) 
N ^ ~'N' 

ij^) (AB) 

N N ' 


(AC) 

N 


-0-918 




The first limit is clearly negative. The second must also be negative, 
since (A B) /N cannot exceed 0 • 023 nor (A C) /.V, 0 014. Hence we cannot 
conclude that there is any limit to (BC) greatei than 0. This result is 
indeed immediately obvious when we consider that, even if all the B’s 
were ^’s, and of the remaining 22 A's 14 were C’s, there would still be 
8 .(4’s that were neither B’s nor C’s. 


1.24 The student should note the result of the last example, as it 
illustrates the sort of result at which one may often arrive by applying the 
conditions (1.10) to practical statistics. I'or given values of N, (/I), (B), 
(C), (^B) and (AC), it will often happen that any value of (BC) not 
less than zero will satisfy the conditions (1.10), and hence no true 
inference of a lower limit is possible. The argument of the type “ So 
many A s are B’s and so many B’s are C’s that we must expect some ^'s 
to be C s must be used with caution. 


1.25 Where the data are not given in terms of the positive or of the 
class-frequencies, and cannot readily be thrown into such a 
rm, the device illustrated in the following example is often useful — 
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Example 1.6. — Among the adult population of a certain town 50 per 
cent of the population are male, 60 per cent are wage-earners and 50 
per cent are 45 years of age or over. 10 per cent of the males are not 
wage-earners and 40 per cent of the males are under 45. Can we infer 
anything about what percentage of the population of 45 or over are 
wage-earners ? 

Denoting the attributes male, wage-earner and 45 years old or more 
by A, B and C, respectively, and letting A^“100 for convenience, we 
have — 

(y4)-50 

{B)=-60 

(C)=50 

5 

(^y)=20 

We require the limits, if any, of (BC), 

Let us note first of all that we are given 6 class-frequencies (including 
A'). If we knew two more, independent of these 6, the problem would 
be completely determinate, for w^e should have 2^ class-frequencies. 

Lot us therefore put 

(a/Jy) ----A 

We can then solve for the ultimate class-frequencies and get 

(A By) - 45- ^ 

(.■I//r)-30- 
(aBC ) ™ .'i — 15 
(.4/?yi-- y -25 
(aBy) -50- >. 

(a//r)'"— 35- A’ 

The coTulition that these mu^i be non-negative gi^•es us conditions on x 
and y. in fact, from. (aBC) and {c(By) we get 

J5 < T 30 

and from {AfiC) and 

IS <30 

the conditions from tlie t)thrr frequencies being included in these limits 
to X and 

Now {Br).--(ABC)4-(aBf) 

-vdA' -15 

and hence, from the limits to x andy, 

25 < (BC) < 45 
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Consequently, the percentage of the population 4v5 years old or more 
(50 per cent of the total population) who are wage-earners lies between 

50 and 90 per cent. 

It is worth while examining whether these limits are the narrowest 
possible which can be assigned with the available data ; and it is easy to 
see that they are. For if .v==15 and y— 25, (BC) ; and if 30 and 
^^ 30 ^ (BCj'-=45, There is nothing in the conditions of the problem to 
prevent and y, and hence (BC), from reaching the limiting values, and 
thus no narrowing of the limits is possible. 


SUMMARY ' 

1. A collection of individuals may be divided into two classes according 
to whether they do or do not possess a particular attribute. This process 
is called dicliotomy. 

2. Continued dicliotomy according to n attributes gives rise to 3« 
classes. 

3. The frequencies in these classes can be expressed in terms of the 2" 
ultimate clas.s frcfiuencies, or of the 2" positive class frequencies. 

4. Given 2^ independent class-frequeiicits, all the class-frequencies may 
be calculated by simple arithmetical processes. 

5. The necessary and sufficient condition for the consistence of a set 
of independent class-frequencies relating to a f)articular population is that 
no ultimate class-frequency which may bti calculated from them is 
negative. 

6. In view of the practical importance of the positive class-frequencies, 
the form of the consistence conditions is expressed solely in terms of such 
frequencies. 

7. The conditions may be applied to the e.xamination of inaccurate or 
incomplete data. For the latter they may allow us to assign limits to 
an unknown class-frequency. 


EXKR( ISKS 


1.1 Ihe following are the numbers ol boys observed with certain classes 
of defects amongst a number of school-children. A denotes development 
defects ; B, nerve signs ; C, low nutrition. 


(AliC) 

149 

(aBC) 

204 

(A By) 

738 

(aBy) 

1,762 

{A/3C) 

225 

{xfiC) 

171 


1,196 


21,842 


Find the frequencies of the positive classes. 
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1,2 The following are the frequencies of the positive classes for the girls 
in the same investigation — 


N 

23,713, 

(AB) 

587 

(A) 

1,618 

{AC) 

428 

{B) 

2,015 

(BC) 

335 

(C) 

770 

(ABC) 

156 


Find the frequencies of the ultimate classes. 

1.3 (Figures from Census, England and Wales, 1891, vol. 3) Convert 
the census statement as below into a statement in terms of {a) the positive, 
{h) the ultimate class-frequencies. ^4 =blindness, S=^deaf-mutism, C = 
mental derangement. 


N 

29,002,525 

{A By) 

82 

(A) 

23,467 

{APC) 

380 

{B) 

14.192 

(aBC) 

500 

(C) 

97,383 

(ABC) 

25 


1.4 Show that if A occurs in a larger proportion of the cases where 
B is than where B is not, then B will occur in a larger proportion of 
the cases where A is than where A is not : i.e. given (.4 6) j{B)>{A^) /(/?), 
show that (AB)I{A) > (aB) /(a). 

1.5 Given that 

{A)^(x)=^{B)^^(/J)==lN 

show that 

(/IB) -(a//). {Afi)~^{atB) 

1.6 Given that 
and also that 

{ABC)^(a^y) 

show that 

2{.4 BC) ^{AB) + {AC) + (BC) 1 N 

1.7 Measurements are mafle on a thousand husbands and a thousand 
wives. If Die measurements of the husbands exceed the measurements of 
the wives in 800 (ases for one measurement, in 700 cases for another, 
and in 660 cases (or both measurements, in how many cases will both 
measurements on the wife exceed the measurements on the husband ? 

1.8 100 children took three examinations. 40 passed the first, 39 passed 
the second and 48 passed the third. 10 passed all three, 21 failed all three, 
9 passed the first two and tailed the third, 19 failed the first two and passed 
the third. Find how many children passed at least two examinations. 

Show that for the question asked certain of the given frequencies are 
not necessary. Which are they ? 
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Show further that the data are not sufficient to permit of the deter- 
mination of the ultimate class-frequencies. 

1.9 (Lewis Carroll, A Tangled Tale, 1881) In a very hotly fought 
battle 70 per cent at least of the combatants lost an eye, 75 per cent at 
least lost an ear, 80 per cent at least lost an arm and 85 per cent at least 
lost a leg. How many at least must have lost all four ? 

1.10 Show that for n attributes ^4, B, C, . . . M, 

{ABC . . . M) > {(A)+{B)+{C)+ . . . +{M)}^{n^l)N ; 

where N is the total frequency ; and hence generalise the result ^f 
Exercise L9. 

1.11 In a free vote in the House of Commons, 600 members voted. 300 

Government members representing English constituencies (including 
Welsh) voted in favour of the motion. 25 Opposition members repre- 
senting Scottish constituencies voted against the motion. The Govern- 
ment majority among those wdio voted was 96. 135 of the members 

voting represented Scottish constituencies. 18 Government members 
voted against the motion. 102 Scottish members voted in favour of the 
motion. The motion was carried by 310 votes. Analyse the voting 
according to the nationality of the constituencies and party. 

1.12 In a war between White and Red forces there are more Red soldiers 
than White ; there are more armed Whites than unarmed Reds ; there 
are fewer armed Reds with ammunition than unarmed Whites without 
ammunition. Show that there are more armed Reds without ammunition 
than unarmed Whites with ammunition 

1.13 If, in an urban district 817 per thousand of the women between 20 
and 25 years of age were returned as occupied " at a census, and 263 
per thousand as married or widowed, what is the lowest proportion per 
thousand of the married or widowed that must have been occupied ? 

1.14 If, in a series of houses actually invaded by smallpox, 70 per cent 
of the inhabitants are attacked and 85 per cent have been vaccinated, what 
is the lowest percentage of the vaccinated that must have been attacked ? 

1.15 Given that 50 per cent of the inmates of an institution are men, 
60 per cent are aged (over 60), 80 per cent non-able- bodied, 35 per 
cent aged men, 45 per cent non-able-bodicd men, and 42 per cent non- 
able-bodied and aged, find tlie greatest and least possible proportions of 
non-able-bodied aged men. 

1.16 The following are the proportions per 10,000 of boys observed for 
certain classes of defects amongst a number of school-children. A = 
development defects, nerve signs, mental dullness. 
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N =10,000 (I>)=789 

[A) = 877 (.45) =338 

(B) = 1,086 (BD)=455 

Show that some dull boys do not exhibit development defects, and state 
how many at least do not do so. 

1.17 The following are the corresponding figures for girls — 

N =10,000 (5) =689 

(.4)= 682 (.45) =248 

(5)= 850 (BD)=363 

Show that some defectively developed girls are not dull, and state how 
many at least must be so. 

1.18 Take the syllogism " All A’s are 5’s, all 5’s are C's, therefore all 
A’s are C’s,” express the premises in terms of the notation of the preceding 
chapter, and deduce the conclusion by the use of the general conditions 
of consistence. 

1.19 Do .the same for the syllogism " All .4’s are 5’s, no 5’s are C’s, 
therefore no A’s are C’s.” 


1.20 Given that (A)=(5)=(C)=iiV, and that (A5) /iV=(AC) /iV=/>, 
find what must be the greatest and least values of p in order that we may 
infer that (5C) IN exceeds any given value, say q. 

1.21 Show that . if 


and 


(d). 

N ■ 



(C) 

N 


--3x 


{AB)__{AC)__{BC) 

N N N ^ 


the value of neither x nor y can exceed J. 

1.22 A market investigator returns the following data. Of 1000 people 
consulted, 811 liked chocolates, 752 liked toffee and 418 liked boiled 
sweets ; 570 liked chocolates and toffee, 356 liked chocolates and boiled 
sweets and 348 liked toffee and boiled sweets ; 297 liked all three. Show 
that this information as it stands must be incorrect. 

1.23 50 per cent of the imports of barley into a country come from the 
Dominions ; 80 per cent of the total imports go to brewing ; 75 per cent 
of the imports are grown in the Northern Hemisphere ; 80 per cent of 
Northern-grown barley goes to brewing ; 100 per cent of foreign Southern- 
grown barley goes to stock-feeding. Show that the foreign Northern- 
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grown barley which goes to brewing cannot be less than 30 per cent nor 
more than 50 per cent of the total imports. 

(It is assumed that brewing and stock-feeding are the only two uses to 
which imported barley is put.) 

1.24 A penny is tossed three times and the results, heads and tails, noted. 
I'he process is continued until there are 100 sets of threes. In 69 cases 
lieads fell first, in 49 cases heads fell second, and in 53 cases heads fell 
third. In 33 cases heads fell both first and second, and in 21 cases heads 
fell both second and third. Show that there must have been at least !| 
occasions on which heads fell three times, and that there could not hav^ 
been more than 15 occasions on which tails fell three times, though there^ 
need not have been any. 



CHAPTER TWO 


ASSOCIATION OF ATTRIBUTES 


Independence 

2.1 If there is no sort of relationship of any kind between two attributes 
A and B, we expect to find the same proportion of A's amongst the B’s 
as amongst the not-fi’s. We may anticipate, for instance, the same 
proportion of abnormally wet seasons in leap years as in ordinary years, 
the same proportion of male to total births when the moon is waxing as 
when it is waning, the same proportion of heads whether a coin be tossed 
with the right hand or the left. 

Two such unrelated attributes may be termed independent, and we 
have accordingly as the criterion of independence for A and B — 

{AB)_(A^ . . . 

(B) ifi) ^ ^ 

If this relation holds good, the corresponding relations 

(B) {/?) 

{AB)_{oiB) 

(X) (a) 

[A) (a) 

must also hold. For it follows at once from (2.1) that 


that is. 


{B)-[AB) _ m-m 

(B) ifi) 

(B) (A) 


and the other two Identities may be amilarly deduced. 

The student may find it easier to grasp the nature of the relations stated 
if the frequencies are supposed grouped into a table with two rows and two 
colmnns, thus — 
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Attribute 

B 

fi 

Total 

A 

{AB) 

{Ap) 

(A) 

a j 

{»B) 

(“^) 

{«) 

Total 

(B) 

ifi) 

N 


Equation (2.1) states a certain equality for the columns ; if this ho^ds 
good, the corresponding equation 

{A) (a) 

must hold for the rows, and so on. 


Forms of the criterion of independence 

2.2 The criterion may, however, be put into a somewhat different 
and theoretically more convenient form. The equation (2.1) expresses 
(AB) in terms of {B), {fi) and a second-order frequency [Afi) ; eliminating 
this second-order frequency we have — 


{AB)JAB)+m_{A) 
{B) lB)+(y?) N 


i.e, in words, " the proportion of amongst the B 's is the same as in the 
population at large.” The student should learn to recognise tliis equation 
at sight in any of the forms — 


{AB) 

{B) N 

(A) N 

N N' N 


ib) 

(c) 

wy 


( 2 . 2 ) 


The equation {d) gives the important fundamental rule ; If the attribiUes 
A and B are independent, the proportion of AB'% in the population is e^ual 
to the proportion of if’s multiplied by the proportion of B's. 

The advantage of the forms (2.2) over the form (2.1) is that they give 
expressions for the second-order frequency in terms of the frequencies of 
the first order and the whole number of observations alone ; the form 
(2.1) does not. 
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Example 2.1. — If there are 144 A's and 384 B's in 1024 observations, 
how many A B’s will there be, A and B being independent ? 


144 x 384 
1024 


=54 


There will therefore be 54 A B’s. 

Example 2.2. — If the w4's are 60 per cent, the B’s 35 per cent, of the 
whole number of observations, ’ivhat must be the percentage of A B’s in 
order that we may conclude that A and B are independent ? 


100 


and therefore there must be 21 per cent (more or less closely, cf. 2.8 and 
2.9 below) of A B's in the population to justify the conclusion that A and 
B are independent. 


2.3 It follows from 2.1 that if the relation (2.2) holds for any one of the 
four second-order frequencies, e.g. {AB), similar relations must hold for 
the remaining three. Thus v'e have directly from (2.1) — 


giving 


(.4A)_(^g)+(^^)_(4) 
(A) ” (B) + (/?■) 'N 




and so on. This is seen at once to be true on consideration of the fourfold 
table on page 20. For if (AB) takes the value (A){B) JN, (A^) must take 
the value {A}(/})jN to keep the total of the row equal to (i4), and so 
on for the other rows and columns. The fourfold table in the case of 
independence must in fact have the form — 


Attribute 

B 

P 

Total 

A 

(A){B)IN 


(A) 

a 

ixmiN 

(«)(/?) /iV 

(«) 

Total 

i 

IB) 

m 

N' 


Example 2.3. — In Example 2.1 above, what would be the number of 
afi’s, A and B being independent ? 

(a) =1024-144 =880 
(^) =1024 -384 =640 


1024 
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2.4 Finally, the criterion of independence may be expressed in yet a 
third form viz. in terms of the second-order frequencies alone. If A and 
B are independent, it follows at once from the preceding section that 


And evidently {ocB){Afi) is equal to the same fraction. 
Therefore 


{AB){afi)=^{aB)(Afi) (a 


{AB) 

(A/^) 

{b) [ 


{aB) 


(2.3) 

(AB) 

(aB) 

{c) 


(A^) 




The equation (6) may be read : The ratio of A*s to as amongst the 
B’fi is equal to the ratio of A's to as amongst the yJ's/* and (c) similarly. 
This form of criterion is a convenient one if all the four second-order 
frequencies are given, enabling one to recognise almo.st at a glance whether 
or neu the two attributes are independent. 

Example 2,4. — If the second-order frequencies have the following values, 
are A and B independent or not ? 

(A B) -1 10 (aB) {A/}) -=290 (a/?) =:510 

Clearly 

{AB)(cc/i} > (ccBUAfi) 
so A and B are not independent. 


Association 

2.5 Suppose now that A and B are not independent, but related in some 
way or other, however complicated. 

ITien if 


(AB)> 


(Am 

N 


A and B are said to be positively associated, or sometimes simply associated. 
If, on the other hand, 

.'1 and B are said to be negatively associated or, more briefly, disassociated. 

The student should carefully note that in statistics the word 
" association " has a technical meaning different from the one current in 
ordinary speech. In common language one speaks of A and B as 
" associated ” if they appear together in a number of cases. But in 
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statistics A and B are associated only if they appear together in a greater 
number of cases than is to be expected if they are independent. Thus, 
if we consider means of land transport as dichotomised into road and rail 
travel, we may say, in the customary use of the term, that road transport 
is associated with speed. But it does not follow that the two are statisti- 
cally associated, because rail transport may equally be associated with 
speed and, in fact, the attribute speed may be independent of the means 
of travel in these two manners. 

Association, therefore, cannot be inferred from the mere fact that some 
A’s are B's, however great the proportion ; this principle is fundamental 
and should always be borne in mind. 

Complete association and disassociation 

2.6 We have now to consider in what circumstances we may regard 
the association of two attributes as complete. Two courses are open to 
us. Either we may say that for complete association all A's must be 
B’s and all B's must be A’s, in which case it must follow that the A's 
and the B’s occur in the population in equal numbers ; or we may adopt 
a rather wider meaning and say that all .^’s are B’s or all B’s are A’s, 
according to whether the A’s or the B’s are in the minority. Similarly, 
complete disassociation may be taken either as the case when no A’s are 
B’s and no a’s are yff’s, or more widely as the case when either of these 
statements is true. 

We shall adopt the wider definition in the sequel. Thus two attributes 
are completely associated if one of them cannot occur without the other, 
though the other may occur without the one. 

Measurement of intensity of association 

2.7 It follows from the foregoing that if two attributes are completely 
associated, (AB) must be equal to (A) or (B), whichever is the smaller. 
If they are completely disassociated, (AB) must be equal to zero 
or to (A)4-{B) — AT whichever is the greater, (AB) must in general lie 
between these two limits. We may thus regard the divergence of (AB) 
from the " independence " value (A)(B) /N towards the limiting value 
in either direction as indicating the intensity of association or disassociation, 
so that we may speak of attributes as being more or less, highly or slightly, 
associated. This conception of degrees of association quanlitativ^y 
expressible is important, and we return in a later section to consider the 
formulae which may be used to measure such degrees. 

Sampling fluctuations 

2.8 When the association is very slight, i.e. where (AB) differs from 
(A)(B) IN by only a few units or by a small proportion, it may be that 
such association is not really significant of any definite relationship. To 
give an illustration, suppose that a coin is tossed a number of times, and 
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the tosses noted in pairs ; then 100 pairs may give such results as the 
following (taken from an actual record) — 

First toss heads and second heads • • *26 

„ M „ tails • • *18 

First toss tails and second heads • • *27 

„ „ tails • • *29 

If we use A to denote heads ** in the first toss, B " heads ** in 

the second, we have from the above (A) =44, (B)=53, Hence 

y ! 

(A)(B) /N=-——=23^32, while actually (AB) is 26. Hence there isW 
100 \ 

positive association, in the given record, between the result of the first 
throw and the result of the second. But it is fairly certain, from the 
nature of the case, that such association cannot indicate any real con- \ 
nection between the results of the two throws ; it must therefore be due 
merely to such a complex system of causes, impossible to analyse, as leads, 
for example, to differences between small samples drawn from the same 
material. The conclusion is confirmed by the fact that, of a number of 
such records, some give a positive association (like the above), but others 
a negative association. 

2.9 An event due, like the above occurrence of positive association, to 
an extremely complex system of causes of the general nature of which 
we are aware, but of the detailed operation of which we are ignorant, is 
sometimes said to be due to chance, or better to the chances or fluctuations 
of sampling. 

A little consideration will suggest that such associations due to the 
fluctuations of sampling must be met with in all classes of statistics. To 
quote, for instance, from 2.1, two illustrations there given of independent 
attributes, we know that in any actual record we should not be likely to 
find exactly the same proportion of abnormally wet seasons in leap years 
as in ordinary years, or exactly the same proportion of male births when 
the moon is waxing as when it is waning. But so long as the divergence 
from independence is not well marked we must regard such attributes 
as practically independent, or dependence as at least unproved. 

The discussion of the question, how great the divergence must be 
before we can consider it as weU marked," must be postponed to the 
chapters dealing with the theory of sampling. At present the attention 
of the student can only be directed to the existence of the difficulty, and 
hi the serious risk of interpreting a " chance association " as physically 
/ignificant. 

The choice of a suitable form for testing association 

2.10 The definition of 2.5 suggests that we are to test the existence 
or the intensity of association between two attributes by a comparison 
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of the actual value of (AB) with its independence value (as it may be 
termed) {A){B)IN, The procedure is from the theoretical standpoint 
perhaps the most natural, but it is more usual, and is simplest and best 
in practice, to compare proportions, e.g. the proportion of A*s amongst the 
B*s with the proportion amongst the Such proportions are usually 
expressed in the form of percentages or proportions per thousand. 

It will be evident from 2.1 and 2.2 that a large number of such com- 
parisons are available for the purpose, and the question arises, therefore, 
which is the best comparison to adopt ? 


2.11 Two principles should decide this point : (1) of any two comparisons, 
that is the better which brings out the more clearly the degree of associa- 
tion ; (2) of any two comparisons, that is the better which illustrates the 
more important aspect of the problem under discussion. 

The first condition at once suggests that comparisons of the form 

(B) ’ 

are better than comparisons of the form 


.... ( 2 . 5 ) 

(B) N ' ' 

For it is evident that if most of the objects or individuals in the population 
are B’s, i.e. if (B) jN approaches unity, {AB) l{B) will necessarily approach 
{A) IN even though the difference between {AB) l{B) and {AP)l(fi) is 
considerable. The second form of comparison may therefore be mis- 
leading. 

Setting aside, then, comparisons of the general form (2.5), the question 
remains whether to apply the comparison of the form (2.4) to the rows or 
the columns of the table, if the data are tabulated as on page 21. This 
question must be decided with reference to the second principle, i.e. with 
regard to the more important aspect of the problem under discussion, 
the exact question to be answered, or the hypothesis to be tested, as 
illustrated by the examples below. Where no definite question has to be 
answered or hypothesis tested both pairs of proportions may be tabulated. 

Example 2.5. — Association between inoculation against cholera and 
exemption from attack. (Data from Greenwood and Yule, Proc. Roy. 


Soc. Med., 1915, 8, 221, 

Table III). 




Not attacked 

Attacked 

Total 

Inoculated 

276 

3 

279 

Not inoculated . 

473 

66 

539 


Total 


749 


69 


818 
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Here the important question is. How far does inoculation protect from 
attack ? The most natural comparison is therefore — 

Percentage of inoculated who were not attacked • • 98*9 

„ not inoculated „ „ • • 87*8 

Or we might tabulate the complementary proportions — 

Percentage of inoculated who were attacked • • 1 • 1 

,, not inoculated „ „ • • 12*2 

Either comparison brings out simply and clearly the fact that inoculk- 
Hon and exemption from attack are positively associated {inoculation and 
mUack negatively associated). i 

We are making above a comparison by rows in the notation of the table 
on page 21, comparing {AB) 1(A) with (aB) /(a), or (AjS) /{A) with (ixfi) /(a).\ 
A comparison by columns, e.g. (A B) 1(B) with (Ap) j(p), would serve 
equally to indicate whether there was any appreciable association, but 
would not answer directly the particular question we have in mind — 

Percentage of not-attacked who were inoculated • • 36*8 

„ attacked „ „ • • 4*3 


Example 2.6. — Eye-colour of father and son (material due to Galton, 
as given by Pearson, Fhil, Trans., A, 1900, 195, 138 ; the classes 1, 2 and 
3 of the memoir treated as “ light 

Fathers with light eyes and sons with light eyes (AB) * * 471 

II I, „ I, not light „ (Ap) • * 251 

„ not h'ght „ light (aB) • • 148 

II ,1 „ I, not light (ap) • • 230 

Required to find whether the colour of the son's eyes is associated with 
that of the father’s. In cases of this kind the father is reckoned once fof 
each son ; e.g. a family in which the father was light-eyed, two sons light- 
eyed and one not, would be reckoned as giving two to the class AB and 
one to the class Ap. 

The best comparison here is — 


Percentage of light-eyed amongst the sons 1 
of light-eyed fathers • • . | 76 per cent 

Percentage of light-eyed amongst the sons \ 
of not-light-eyed fathers * • . j »» 

But the following is equally valid — 


Percentage of light-eyed amongst the 
fathers of light-eyed sons 

Percentage of light-eyed amongst the 
fathers of not-light-eyed sons 


76 per cent 


40 
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The reason why the former comparison is preferable is that we usually 
wish to estimate the character of offspring from that of the parents, and 
not viu versa. Both modes of statement, however, indicate equally 
clearly that there is considerable resemblance between father and son. 

Example 2.7 — Association between inoculation against cholera and 
exemption from attack, five separate epidemics (cf. Example 2.5, data 
from Tables IX. X, XXVIII, XXIX, XXXI of the paper there cited.) 



Not attacked 

Attacked 

Total 

Inoculated 

192 

4 

196 

Not inoculated • 

113 

34 

147 

Total • 

305 

38 

343 


Not attacked 

Attacked 

Total 

Inoculated 

5,751 

27 

5,778 

Not inoculated • 

6,351 

198 

6,549 

Total • 

12,102 

225 

12,327 


Not attacked 

Attacked 

Total 

Inoculated 

4,087 

5 

4,092 

Not inoculated • 

113,856 

1,144 

115,000 

Total • 

117,943 

1,149 

119,092 


Not attacked 

.\ttacked 

Total 

Inoculated 

8,332 

8 

8,340 

Not inoculated * 

84,444 

556 

85,000 

Total • 

92,776 

564 

93.340 


Not attacked 

Attacked 

Total 

Inoculated 

4,870 

5 

4,875 

Not inoculated • 

153,096 

904 

154,000 

Total • 

157,966 

909 

158,875 


With the table of Example 2.5 the above give data for six sei»rate 
epidemics, in all of which the same method of inoculation appears to have 
been used : the data refer to natives only, and the numbers of observations 
are sufficiently large to reduce ** fluctuations of sampling " within reason* 
ably narrow limits. The proportions not attacked are as fdlows— 
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1 • 

2 


Proportion not attacked 

Not inoculated Inoculated 

• 0*8776 0*9892 

Difference 

0-1116 


. 0-7687 

0-9796 

0-2109 

3 


• 0-9698 

0-9953 

0-0255 

4 


• 0-9901 

0-9988 

0-0087 

5 • 


- 0-9935 

0-9990 ^ 

0-0055 

6 • 


- 0-9941 

0-9990/ 

0-0049 


In each case inoculation and exemption fntn attack are positively 
associated, but it will be seen that the several proportions, and the differ| 
ences between them, vary considerably. Evidently in a very mila 
epidemic this difference can only be small, and the question arises how[ 
far the data for the separate epidemics can be said to be consistent in 
their indication of the " efficiency ” of the inoculation. This is not a 
simple question to answer ; the more advanced student is referred to the 
discussion in the original. 

The symbols {AB)^ and S 

2.12 The values that the four second-order frequencies take in the 
case of independence, viz. 

[A){B) (cc)(B) (5)(A) 

N N N N 

are of such great theoretical importance, aind of so much use as reference- 
values for comparing with the actual values of the frequencies (AB), (xB), 
(AP) and (a/?), that it is often desirable to employ single symbols to denote 
them. We shall use the symbols 


{AB),-. 


.{ Am 

N 


(“/?)o= 




{«B)o 


(«)(g) 

N 


{AP)o= 


{Am 

N 


If d denote the excess of {A B) over {A B)^, then, in order to keep the totals 
of rows and columns constant, the general table (cf. the table for the case 
of independence on page 21) must be of the form — 



Therefore, quite generally we have — 
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2.13 The value of this common difference i may be expressed in a form 
that is useful to note. We have by definition — 

Bring the terms on the right to a common denominator, and express all 
the frequencies of th^umerator in terms of those of the second order ; 
then we have — N 

I J{AB)iiAB)+{ocB)+{Afi)+(ccfi)] 1 
n{ - [{A BH(Afi)][(AB) +(a5)] / 

\{AB){cc^)-[xB){Afi)} 


That is to say, the common difference is equal to 1 fNth of the difference 
of the “ cross-products ” (AB){afi) and {aB)(A/}). 

It is evident that the difference of the cross-products may be very 
large if N be large, although S is really very small. In using the difference 
of the cross-products to test mentally the sign of the association in a case 
where aU the four second-order frequencies are given, this should be 
remembered ; the difference should be compared with N, or it will be 
liable to suggest a higher degree of association than actuahy exists. 

Example 2.8 — The following data were observed for hybrids of Datura 
(Bateson and Saunders, Report to the Evolution Committee of the Royal 
Society, 1902) — 

Flowers violet, fruits prickly {AB) • • 47 

„ „ smooth (Afi) • • 12 

Flowers white, „ prickly (aB) • • 21 

„ „ smooth (ay?) • • 3 

Investigate the association between colour of flower and character of 
fruit. 

Since 3 x 47=141, 12 x 21=252 i.e. {AB){a.fi) < {aB){AP), there is 
clearly a negative association; 252—141=111, and at tot sight this 
considerable difference is apt to suggest a considerable disassociation. But 
^=111/83 =1*3 only, and forms a small proportion of the frequency, so 
that in point of fact the disassociation is small, so small that no stress can 
be laid on it as indicating anything but a fluctuation of sampling. Work- 
ing out the percentages we have — 


Percentage of violet-flowered plants with 

prickly fruits 

Percentage of white-flowered plants with 
prickly fruits 


80 per cent 


87 


»9 
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Coefficient of association 

244 In the previous examples we have judged the association by 
comparing the class-frequencies with those which would exist if the data 
were given by independent attributes, and we can form a rough idea of 
the strength of the association by examining the extent of the difference. 
This is sufficient for almost all practical purposes, although, if the data 
are likely to be affected seriously by fluctuations of random sampling, 
some test of the significance of the difference is also necessary. Apart 
from this question, however, it is sometimes convenient to measure the 
intensities of the associations by means of a coefficient. | 

It is clearly convenient if such a coefficient can be devised as to b4 
zero if the attributes are independent, -f 1 if they are completely associated^ 
and— 1 if they are completely disassociated. ' 

2.15 Many such coefficients may be devised, but perhaps the simplest 
possible (though not necessarily the most advantageous) is the expression — 


Nd 

{AB)(afi)+[Afi)(aB) 


where S is the symbol used in 2.12 and 2.13 for the difference (AB)- 
{AB)f^ It is evident that Q is zero when the attributes are independent, 
for then d is zero : it takes the value -j-l when there is complete association,, 
for then the second term in both numerator and denominator of the 
first form of the expression is zero : similarly it is —1 where there is 
complete disassociation, for then the first term in both numerator and 
denominator is zero. Q may accordingly be termed a coefficient of 
association. As illustrations of the values it will take in certain cases, 
the association between light eye-colour in father and in son (Example 2.6) 
is •4-0’66 , between coldur of fiower and prickliness of fruit in Dattera 
(Example 2.8), — 0-28 : a disassociation which, however, as already 
stated, is probably of no practical significance and due to mere fluctuations 
of sampling. 

The student should note that if all the terms containing A are multiplied 
^ a constant, the value of § is imaltered. Similarly for a, B and B. 
H^ce ^ mdependent of the relative proportions of A’s and a's in the 
^a. This property is important, and renders such a measure of associa- 
tion speciaffy adapted to cases in which the proportions are arbitrary 
(e.g. expenments). ' 
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2.16 Another coefficient which has the same property is the coefficient 
of colligation. 


/ {AP){xB) 

j mm 

{AB){ap) 

It is easy to show that 




2Y 

1+y* 


( 2 . 6 ) 


(2.7) 


Association in sub>pqnilations 

2.17 Up to this point we have considered association between two 
attributes in a population without regard to whether any information 
existed about other attributes in the population. If, however, such 
information does exist and, say, we can find the frequency-classes of 
attributes C, D, etc., the question arises. What are the associations of 
A and B in the sub-populatioas C, y, CD, etc. ? 

Thus, if A =standard of health and B=consumption of food, the fore- 
going discussion would enable us to examine whether health and food- 
consumption were associated in any particular population, say the popula- 
tion of Great Britain. But we might want to go further than this and 
examine the association between A and B among males, or among the 
poorer classes, and compare it with the association among females or amoI^; 
the well-to-do classes, respectively. Defining C =males and D=poor, this 
amounts to examining the associations of A and B in the populations C, y, 
D and S. 

2.18 Associations of this kind are of the utmost importance in statistical 
practice. As instances of the ways in which they arise let us consider the 
foUouring two illustrations — 

(1) Suppose that we have established, in the manner of foregoing 
sections, a positive association between inoculation and exemption from 
smallpox in a population of persons. It is natural to infer that tMs associa- 
tion is due to some causal relation between the two attributes and may be 
expected to recur in the future ; in short, that smallpox is prevented by 
vaccination. 

This rather hasty conclusion might, however, meet an opponent who 
argues in this way : vaccination is accepted among the well-to-do classes, 
but is looked on with suspicion by the lower classes. For this and other 
reasons most of the unvaccinated persons arc drawn from the lower classes. 
But these are precisely the people whom, from the unhygienic conditions 
under which they Uve, one would expect to be exposed to infection and 
who, moreover, being malnourished, would be more likely to contract 
disease when they were infected. Hence the comparative exemption of 
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the vaccinated persons is not due to the fact that they have been vaccinated, 
but to the fact that they belong to the well-to-do classes. It is, as it were, 
an accident that these people also happen to be from a class which favours 
vaccination. 

Denoting vaccination by A, exemption from attack by B and hygienic 
conditions by C, this argument amounts to saying that the observed 
association between A and B is not of itself causally direct, but is due to 
the associations of both A and B with C. 

Now it is clear that this objection could not be lodged if the hygienic^ 
conditions among all the members of the population were the same. Ifi 
therefore, we examine the association of A and B in the sub-population C\ 
and still find an association, the supposed argument will be refuted. We ) 
are thus led to a consideration of the association in that sub-population. 

(2) As a second example, suppose that an association is noted between 
the presence of an attribute in the father and the presence in the son, and 
also between the presence in the grandfather and the presence in the grand- 
son. The question which arises here is : Does the resemblance between 
grandfather and grandson arise from a kind of hereditary transmission 
which may, in the common phrase, " skip a generation,” or is it merely 
due to the fact that the grandfather is like the father and the father is like 
the son ? 

Denoting the presence of the attribute in the son. father and grand- 
father by A, B and C, the question is : Is the association between A and C 
due to associations between A and B, and B and C ? 

If the association between A and C is observed among all the cases in 
which the father possesses the attribute or all those in which he does not, 
and is still sensible, clearly the association between A and C cannot be due 
to associations between A and B, B and C ; hence, as before, to resolve 
the question we are led to consider the association between A and C in the ' 
sub-populations B and fi. 

2.19 Generally, ambiguity of the type to which we have just referred 
arises from the fact that the population under discussion contains not 
merely objects possessing the third attribute alone, but a mixture of 
objects with and without it. To meet the requirements of the discussion 
we have to consider the associations in sub-populations wherein this attri- 
bute is entirely absent or entirely present. By this means we can go 
deeper into the nature of the underlying causes and eliminate certain 
possible explanations of the type : an association between A and B does 
not mean that the two are directly related, but only that each is associated 
with a third attribute C. 

Paitfad associations 

2.20^ The associations between A and B in sub-populations are called 
ptrtud assodatioHs, to distinguish them from the total associations betw e en 
A ttid B in the population at large. 
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As for total association, A and B are said to be positively associated 
in the population of C’s if 




( 2 . 8 ) 


and negatively associated in the converse case. 

Similarly they are positively associated in the population of CD’s if 

(A BCD) > (2.9) 

and so on. These formulae are derived from the formula for total associa- 
tion by specifying the population in which the partial association exists. 

Alternative forms of the conditions for partial association 
2.21 As in the case of total association, the above forms can be written 
in many ways, adapted to the nature of the data and of the question 
which is to be answered. The partial association is most conveniently 
tested by comparisons of percentages or proportions in the manner of 2.2| 
and we may quote the four most convenient comparisons in the case 
of three attributes — 


(ABC) 

(BC) (C) 
{ABC) [APQ 
'{BC) JfiC) 


(0 


{ABC) {BC) 
{AC) (C) 
(ABC) aBC) 
(AC) (aC) 


(^) 

V (2.10) 

(d) ' 


Similar formulae may be written down for the cases of four or more 
attributes, and the methods of this chapter are applicable to such cases. 
For the sake of simplicity we shall, however, confine ourselves to three 
attributes hereafter. 

Example 2.9. — The following are the proportions per 10,000 of boys 
observed with certain classes of defects amongst a number of school- 
children. (A) denotes the number with development defects, (B) the 
number with nerve signs (D) the number of the " dull.” 


N 

10,000 

(AB) 

338 

(^) 

877 

(AD) 

338 

{B) 

1,086 

(BD) 

455 

{D) 

789 

(ABD) 

153 


The Report (referred to in Example 1.1) from which the figures are drawn 
concludes that “ the connecting link between defects of body and mental 
dullness is the coincident defect of brain which may be known by observa- 
tion of abnormal nerve signs.” Discuss this conclusion. 

The phrase " connecting link ” is a little vague, but it may mean that 
the mental defects indicated by nerve signs B may give rise to develop- 
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ment defect# A, and also to mental dullness D ; A and D being thu# 
common effects of the same cause B (or another attribute necessarily 
indicated by B) and not directly influencing each other. The case is 
thus similar to that of the first illustration of 2.18 (liability to smallpox 
and to non-vaccination being held to be common effects of the same 
circumstances), and may be similarly treated by investigation of the 
partial associations between A and D for the populations B and fi. As the 
ratios (A)jN, (B)IN, (D) /N are small, comparisons of the form (2.10), 
(fl) and (6) above, may be used. j 

The following figures illustrate, then, the association between A and D 
for the whole population, the ^-population and the /^-population — \ 

For the entire material — \ 


Proportion of the dulI=(Z)) jN 

789 

10,000 

« /\ _ _ ... A. 

— • ! *7 WVill. 

„ „ defectively developed who] 

were dull = (AD) /(A) • • -J 

i _ 338 
877 

=38-5 „ 

For those exhibiting nerve signs — 

455 

“ 1,086 


Proportion of the dull=:(5D) /(D) • 

=41 -9 per cent 

„ „ defectively developed who 

wereduU=(ABD)/(AD) • 

153 

338 

=45-3 „ 

For those not exhibiting nerve signs — 

334 

8,914 


Proportion of the dull =(/?D)/(^) • 

= 3-7 ., 

„ „ defectively developed who 

weredull=(A/?D)/(A/?) • 

_ 185 
“ 539 

=34-3 ,. 


The results are extremely striking : the association between A and D 
is high both for the material as a whole (the population at large) and for 
those not exhibiting nerve signs (the ;?-population), but it is small for those 
who do exhibit nerve signs (the B-population). 

This result does not appear to be in accord with the conclusion of the 
Report, as we have interpreted it, for the association between A and D 
in the /?-population should in that case have been low instead of high. 


Notation for partial assodations 

2.22 We now introduce a notation which is analogous to that used 
for total associations. It will be remembered that in 2.13 we wrote — 


{AB),J^). 

S^{AB)-{AB)^ 
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We now write — 


^B.C), [AB.CD), 

Sah.(~{ABC)~{AB.C\, SAb.cii={ABCD)-{AB .CD)^. etc.) 

The ^-numbers measure the divergence of the actual frequencies from 
those which would exist if the attributes were independent in the sub- 
population under discussion. 

It is also possible to generalise the coefficient of association Q by defining 
partial coefficients of the type 

BC){xpC) - {AJC){xBC ) , 
{ABC){xfiCy^{ApC){xBC)^ 

{A BCiioifiC) + iAfC){xBC) I 

The student will notice that the formulae for the ^-numbers and for 
the Q numbers are obtained from the expressions for total association by 
specifying the population in which the partial association is to be con- 
sidered. They need not therefore be memorised. 


Number of partial associations 

2.23 For three attributes A, B, C there are three total associations, 
namely, those of A with B, B with C and C with A ; and six partial 
associations, namely, those of ,4 and B in C and y, B and C in .4 and a, 
and C and 4 in B and //. 

For four attributes there are fifty-four associations ; for we can choose 
two attributes from four in six ways, and there are nine associations for 
each pair (one total, four partials in the sub-populations specified by one 
attribute, and four partials in the sub-populations specified by two). 

We state without proof that for n attributes there are --S"-* 

associations. Of these, ” are total and the remainder partial. For 


» > 4 this number is so large as to be almost unmanageable. For instance, 
if «==5 it is 270, and if «=6 it is 1215. 

The large number of partial associations which exists might be thought 
to occasion some difficulty. We may, how'ever, reassure ourselves by 
two considerations. 

In the first place, it is rarely necessary to investigate in any tactical 
instance all the partial associations which are theoretically possible. For 
instance, in Example 2.9 the total and partial associations between A 
and D were alone investigated ; those between A and B, B and D were 
not essential for answering the. question which was asked. 
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Relations between partial associations 

2.24 In the second place, a theoretical discussion of the partial associa- 
tions is assisted by the following result : The 3"-^ associations are 

A* 

all expressible in terms of 2»»-(n + l) algebraically independent associa- 
tions, together with the class-frequencies 'N , (A), (B), (C), etc. 

In fact, we saw in Chapter 1 that all the class-frequencies can be 
expressed in terms of the positive class-frequencies, which are 2»» in| 
number in the case of n attributes. Hence the frequencies iV, (A), (B),l 
(C), etc., of which there are (n-f 1), together with the 2'»~(« + l) other \ 
positive frequencies, completely determine the data, and hence determine \ 
the associations, which are expressed in terms of the data. Hence the 
number of algebraically independent associations which can be derived 
is only 2*»— (n + 1). 

2.25 In practice the existence of these relations is of little or no value. 
The formal relations between the ratios and the ^-numbers which express 
the associations are, in fact, so complex that lengthy algebraic manipula- 
tion is necessary to express those which are not known in terms of those 
which are. It is usually better to evaluate the class-frequencies and 
calculate the desired results directly from them. 


2.26 There is, however, one result which has important theoretical 
consequences. 

We have, by definition, 

S.s.c=(ABC)-i^P 

SAB.y={ABy)- 

Hence, 


^iAB)-^\N{AC){BC)-(A){C){BC)-{B)(C)iAC) 

+{Am{cy 




(BC) 




N 

■(Q(r) 


SacSbc 


(B)(C)'l 

- jv"/ 

( 2 . 13 ) 


This gives us the sum of the ^-numbers for the partial associations of A 
and B in C and y in terms of the total associations between B and C. 
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Now suppose that A and B are independent in C and y. Then we 
have — 

SAB.C—Slb.y—O 

and 

Sab—jJ^SacSbc 

(Q(r) 

Sab is not zero unless one or both of Sac, Sbc are zero. 

Hence, if A and B are independent within the populations of C’s and 
not-C’s, they will nevertheless -be associated in the population at large 
unless C is independent of ^4 or B or both. 

Illusory associations 

2.27 This peculiar result indicates that, although a set of attributes 
independent of A and B will not affect the association between them, the 
existence of an attribute C with which they are both associated may give 
an association in the population at large which is illusory in the sense that 
it does not correspond to any real relationship between them. If the 
associations between A and C, B and C are of the same sign, the resulting 
association between A and B will be positive ; if of opposite signs, 
negative. 

The cases which we discussed at the beginning of this chapter are 
instances in point. In the first illustration we saw that it was possible to 
argue that the positive associations between vaccinaiion and hygienic con- 
ditions, exemption from attack and hygienic conditions, led to an illusory 
association between vaccination and exemption from attack. Similarly, the 
question was raised whether the positive association between grandfather 
and grandchild may not be due to the positive associations between grand- 
father and father, and father and child. 

2.28 Misleading associations may easily arise through the mingling 
of records which a careful worker would keep distinct. 

Take the following case, for example. Suppose there have been 200 
patients in a hospital, 100 males and 100 females, suffering from some 
disease. ’ Suppose, further, that the death-rate for males (the case mor- 
tality) has been 30 per cent, for females 60 per cent. A new treatment is 
tried on 80 per cent of the males and 40 per cent of the females, and the 
results published without distinction of sex. The three attributes, with 
the relations of which we are here concerned, are death, treatment and male 
sex. The data show that more males were treated than females, and more 
females died than males ; therefore the first attribute is associated nega- 
tively, the second positively, with the third. It follows that there will be 
an illusory negative association between the first t'no—deeUk and ireaiment. 
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If the treatment were completely inefficient we should, in fact, have the 


following results — 

Treated and died • 

„ and did not die . 
Not treated and died 

and did not die 


Males 

Females 

Total 

24 

24 

48 

56 

16 

72 

6 

.36 

42 

14 

24 

38 


i.e. of the treated, only 48/120=40 per cent died, while of those not 
treated 42 /80 =52 • 5 per cent died. If this result were stated without an;^ 
reference to the fact of the mixture of the sexes, to the different proportions 
of the two that were treated and to the different death-rates under normal'^ 
treatment, then some value in the new treatment would appear to be ; 
suggested. To make a fair return, either the results for the two sexes 
should be stated separately, or the same proportion of the two sexes must 
receive the experimental treatment. Further, care would have to be taken 
in such a case to see that there was no selection (perhaps unconscious) of 
the less severe cases for treatment, thus introducing another source of 
fallacy (deaih positively associated with severity, treatment negatively 
associated with severity, giving rise to illusory negative association between 
treatment and death). 


2.29 Illusory associations may also arise in a different way through 
the personality of the observer or observers. If the observer's attention 
fluctuates, he may be more likely to notice the presence of A when he 
notices the presence of B, and vice versa ; in such a case A and B (so far as 
the record goes) will both be associated with the observer’s attention C, 
and consequently an illusory association will be created. Again, if the 
attributes are not well defined, one observer may be more generous than 
another in deciding when to record the presence of A and also the presence- 
of B, and even one observer may fluctuate in the generosity of his marking. 
In this case the recording of A and the recording of B will both be associated 
with the generosity of the observer in recording their presence, C, and an 
illusory association between A and B will consequently arise, as before. 

Detennination of sign of assodathm when the data are incomplete 

2.30 It is important to notice that, though we cannot actually determine 
the partial associations unless the third-order frequency {ABC) is given, 
we can make some conjecture as to their signs from the values of the 
second-order frequencies. 

In 2.26 we have — 

■ ■ (2.14) 

(^/ (7/ 

Hence, if the expression on the right is positive, one at least of 
iAB.y, is p(»itive, i.e. A and B are positively associated either in C or y 
or both, ^milarly, if the expression is negative, A and B are negatively 
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associated either in C or in y or in both. Finally, if the expression is 
zero, A and B are either independent in both C and y, or positively 
associated in one and negatively in the other. 

The expression may be thrown into a form more convenient when 
percentages are given. Dividing through by (B) we have — 


SAB.c+SAB.y_{AB) {AC){BC) {Ay) {By) 
(B) “ (B) ' (C) (B) ■ (y) (B) 


(2.15) 


The following example illustrates the method. 

Example 2.10 (Figures compiled from the Registrar-General's Decennial 
Supplement, 1931, Part II a — 1938). The following are the mean annual 
death-rates for occupied (including retired) males of 16 years of age and 
over for England and Wales during the three years 19^1932. 

Death rate per thousand 

Occupied and retired males over 16 . 14-63 

Farmers over 16 . 19-68 

Anghcan clergy over 16 .27-81 

Coal hewers and getters over 16 . . 14*69 


At first sight it appears that coal hewing is about the average in healthiness 
(as measured by death rate) and that farmers and clergy are decidedly 
unhealthy. These conclusions are quite wrong. 

The following are the proportions of the occupations 65 years old or 
more at the census date 1931 — 


Occupied and retired males 

Farmers 

Anglican clergy 

Coal hewers and getters 


Proportion per thousand 
65 years of age or more 

. 86-8 
. 172-1 
. 279-4 
. 68-6 


For the whole class of occupied and retired males the death rates for the 
groups 16-65 years and 65 years and over were 7*93 per thousand and 
85-10 per thousand. 

If A denote death, B the given occupation, C old age, we have to apply 
the principles of equation (2.15), calculate what would be the death-rate 
for each occupation on the supposition that the rates for occupied and 
retired males in general (7*93 and 85-10) apply to each of the separate 
age-groups (16-65, 65 and over), and see whether the total death-rate 
so calculated exceeds or falls short of the actual death-rate. . If it exceeds 
the actual rate the occupation must on the whole be healthy ; in the 
contrary case, unhealthy. Thus we have the following calculated death 
rates— 


Farmers . 7-93x -8279-)-85-10x 1721»=21-20 

Anglican deigy 7-93x-7206-f85>l0x 2794»29-48 

Coal hewers and getters 7-93x -9314-i-85-10x 0686=13-21 
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The calculated rate for farmers and clergy largely exceeds the actual 
rate ; these occupations then must, on the whole, be healthy. On the 
other hand the rate for coal hewers and getters falls short of the actual 
rate and this occupation is relatively unhealthy. The true facts are 
masked in the death-rates for the occupations taken irrespective of age by 
the various proportions of young and old engaged in the occupations. 

It is evident that age-distributions vary so largely from one occupation 
to another that total death-rates are hable to be very misleading. Similar 
fallacies are liable to occur in comparisons of local death-rates, owing 
to variations not only in the relative proportions of the old, but also ih 
the relative proportions of the two sexes. \ 

It is hardly necessary to observe that as age is a variable quantity, the 
above procedure for calculating the comparative death-rates is extremely 
rough. The death-rate of those engaged in any occupation depends not 
only on the mere proportions over and under 65, but on the relative 
numbers at every single year of age. The simpler procedure brings out, 
however, better than a more complex one, the nature of the fallacy involved 
in assuming that crude death-rates are measures of healthiness. 

Complete independence 

2.31 The particular case in which ail the 2"— (n-f-l) given associations 
are zero is worth some special investigation. 

It follows, in the first place, that all other possible associations must be 
zero, i.e. that a state of complete independence, as we may term it, exists. 
Suppose, for instance, that we are given — 

(fi)(C) {AC)(BC) _{Am{C) 

' ' AT ^ ’ (C) ■ iV* 

Then it follows at once that we have also — 

(A Bn _ {ab}{bc) _{ab){AC) 

^ ^ (B) (A) 

i.e. A and C are independent in the population of B's, and B and C in the 
population of .4’s. Again, 

{ABy)^{AB)-{ABC)J^- 

_ {A){B)(y) _ {Ar){Br) 

N‘ (y) 

Therefore A and B are independent in the population of y’s. Similarly, it 
may be shown that A and C are independent in the population of 's, B and 
C in the population of a's. 

In the next place it is evident from the above that relations of the 
general form (to write the equation symmetrical^) 
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{ABC)_{A) (B) (C) 

—N 

must hold for every class-frequency. This relation is the general form of 
the equation of independence (2.2) (4). 

2.32 It must be noted, however, that (2.16) is not a criterion for the 
complete independence oi A, B and C in the sense that the equation 

{AB) (^ ^ 

N " N‘ N 

is a criterion for the complete independence of A and B. If we are given 
N, (i4) and (B), and the last relation quoted holds good,' we know that 
similar relations must hold for (A/?), (acB) and (a/?). If N, {A), {B) and 
(C) be given, however, and the equation (2.16) holds good, we can draw no 
conclusion without further information ; the data are insufficient. There 
are eight algebraically independent class-frequencies in the case of three 
attributes, while N, (A), (B), (C) are only four : the equation (2.16) must 
therefore be shown to hold good for /our frequencies of the third order 
before the conclusion can be drawn that it holds good for the remainder, i.e. 
that a state of complete independence subsists. The direct verification of 
this result is left for the student. 

Quite generally, if N, (A), (B), (C), ... be given, the relation 

(^BC . . . ) _{A) (B) (CJ 

N N ' N' N • K ■ I 

must be shown to hold good for 2*— (n-|-l) of the nth order classes before it 
may be assumed to hold good for the remainder. It is only because 

2«-(n-f-l)=l 

when n=2 that the relation 

(AB) {A)(^ 

N " N' N 

may be treated as a criterion for the independence A and B. If all the 
n (n > 2) attributes are completely independent, the relation (2.17) holds 
good ; but it does not follow that if the relation (2.17) holds good they are 
all independent. 


SUMMARY 

1. Two attributes are independent if the proportion of A’s among the 
B's is the same as the proportion among the not-B’s. 

2. This definition can be expressed symbolically in numerous forms, in 
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terms of either first-order or second-order frequencies. The form in which 
the data are given, and the question which is to be answered, determine 
which form is to be employed in any particular case. 

3. Attributes which are not independent are said to be positively 
associated if 


(AB)> 


ms) 

N 


and negatively associated if 

{AB)< 


(jm 

N 


4. The statistical meaning of the word " association ” is different from 
the meaning ascribed to it in ordinary language. 

5. Before association may be said to indicate a definite relation between 
the attributes, it is necessary to be satisfied that the divergence from 
independence is not due to fluctuations of sampling. 

6. The divergence of the actual frequency from the " independence ” 
frequency is denoted by the S 3 nmbol S, and hence 


i=iAB) 


N 


7. The coefi6cient of association is defined by 


{AB){afi)+iAfi){aB) 

It is zero if the attributes are independent, -f 1 if they are completely 
associated and —1 if they are completely disassociated. There are, 
however, other forms of coefficient more advantageous in certain cases. 

8. The association of A and B in sub-populations of the type C, y, CD,. 
CDE, etc. is called a partial association. 

A and B are positively associated in C ; and if 


(ABC) < 


(AC)(BC) 

(Q 


A and B are negatively associated in C. 


10, There are associations in a population characterised by 


nf ft T 'i 

« attributes, — ' of which are total and the remainder partiaL 

11. All the associations are expressible in terms of N, {A), (B), (C), 
etc., and 2»— (n-fl) algebraically independent associations. These rektioiis 
have, however, only a theoretical value. 
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12. If A and B are independent within the population of C's they will 
nevertheless be associated within the population at large, unless C is inde- 
pendent of either ^ or B or both. 

13. In interpreting an association between A and B it must be remem- 
bered that this may arise owing to associations of A with C and B with 
C. To resolve this point it is necessary to consider the partial associations 
of A and B in C and y. 

14. Complete independence of n attributes occurs if 2**— (n-fl) algebraic- 
ally independent associations and hence all associations are zero. In this 
case 

N N N N ' ' ' 

but this last condition is not sufficient for complete independence. 


EXERCISES 


2.1 At the census of England and Wales in 1901 there were (to the nearest 
1 ,000) 15,729,000 males and 16,799,000 females ; 3,497 males were returned 
as deaf-mutes from childhood, and 3,072 females. 

State proportions exhibiting the association between deaf-mutism from 
childhood and sex. How many of each sex for the same total number 
would have been deaf-mutes if there had been no association ? 

2.2 Show, as briefly as possible, whether A and B are independent, 
positively associated or negatively associated in each of the following 
cases — 

(a) 5,000 (.4) 2,350 (B)=: 3,100 {AB)= 1,600 

{b) 490 (AB)^ 294 (a)=^ 570 (aB)= 380 

(c) (i4B)= 256 (aB)== 768 (Afi)^ 48 144 

2.3 (Figures derived from Darwin's Cross- and Sdf-ferHlisaiion of 
Plants,) The table below gives the numbers of plants of certain species 
that were above or below the average height, stating separately those 
that were derived from cross-fertilised and from self-fertilised parentage. 
Investigate the association between height and cross-fertilisation of 
parentage, and draw attention to any special points you notice. 


Species 


Parentage cross-fer- 
tilised. Height — 

Parentage self-fer- 
tilised. Height — 

Above 

Below 

Above 

Below 

average 

average 

average 

average 

63 

10 

18 

55 

61 

16 

13 

64 

25 

7 

ll 

21 

39 

16 

25 


17 

17 

12 

22 j 


Ipomaea purpurea. 
Petunia violacea . 
Reseda lutea 
Reseda odorata . 
Lobelia fulgens . 
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2.4 (Figures from same source as Example 2.6 ; classes 7 and 8 of the 
memoir treated as " dark. ”) Investigate the association between darkness 
of eye-colour in father and son from the foDowing data — 

Fathers with dark eyes and sons with dark eyes (AB) SO 
„ „ „ not-dark eyes {Afi) 79 

Fathers with not-dark eyes and sons with dark eyes (aB) 89 
„ „ „ not-dark eyes fa/ff) 782 

Also tabulate for comparison the frequencies that would have been 
observed had there been no heredity, i.e. the values of {AB)^, {Afi)^, etc. 

2.5 (Figures from same source as above.) Investigate the association 
between eye-colour of husband and eye-colour of wife (“ assortative 
mating ”) from the data given below. 

Husbands with light eyes and wives with light eyes (•'4^) • 309 
■. „ not-light eyes (Afi) . 214 

Husbands with not-light eyes and wives with light eyes (aJ3) . 132 

.. .. „ not-light eyes (otfi) . 119 

Also tabulate for comparison the frequencies that would have been 
observed had there been strict independence between eye-colour of husband 
and eye-colour of wife, i.e., the values of (AB)^, etc., as in Exercise 2.4. 

2.6 (Figures from the Census of England and Wales, 1891, vol. 3: the 
data cannot be regarded as trustworthy.) The figures given below show 
the numl»r of males in successive age-groups, together with the number 
of the blind (A), of the mentally deranged (B) and the blind mentally 
deranged (AB). Trace the association between blindness and mental 
derangement from chUdhood to old age, tabulating the proportions of 
insane amongst the whole population and amongst the blind, and also 
the Association coefficient Q of 2.15. Give a short verbal statement of 
your results. 



s- 

15- 

25- 

35- 

45- 

55 

65- 

75 and 
upwards 

N 

lA) 

(B) 

(AB) 

3,304.230 

S44 

2,820 

17 

2,712,521 

1,184 

6,225 

19 

2,089,010 

1,165 

8,482 

19 

1,611,077 

1,501 

9,214 

31 

1,191,789 

1,752 

8,187 

32 

770,124 

1 1,905 

5,799 
34 

444,896 

1,932 

3,412 

22 

161,692 

1,701 

1,098 

9 


2.7 Show that if 


(AB), (uB), (Afi), (afi), 

(AB), (ccB), (Afi), (afi), 

be two aggregates corresponding to the same values of (A), (B), (a) ^d (fi), 

{AB),-(AB),^(aB),-{aB),^(Afi),~(Afi),^(afi),-(afi), 
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2.8 Show that if 

8=={AB)-{AB), 

(^B)HW-(aB)*-(^/?)*=[{^)-{a)][(B)-(^)]+2N« 

2.9 The existence of association may be tested either by comparison of 
proportions (e.g. [AB)j{B) with {Afi)l{fi)), as in 2.10 and 2.11, or by the 
value of S as in 2.12 and 2.13. Show that 

{Bmf jAB) {AfiU 
N 1 (B) W J 

_{A){cc}nAB) {uBY 
N \^{^) (a) 

2.10 Spence and Charles, in An Investigation into the Health and Nutrition 
of Certain of the Children of N ewcastle-on-Tyne between the Ages of One 
and Five Years (City and Council of Newcastle-on-Tyne, February 1934), 
compared two groups of children, one belonging to the professional classes, 
125 in number, and the other belonging to the labouring classes, 124 in 


number. They found the following results — 

Poor Well-to-do 

Children Children 

Per cent Per cent 

Below normal weight ... 55 13 

Above normal weight ... 11 48 


Find the coeflftcient of association between the weight of the children and 
their social status. 

2.11 (Data from the Report on the Spahlinger Experiments in Northern 
Ireland, 1931-1934, H.M. Stationery Of&ce, 1935.) In experiments on 
the immunisation of cattle from tuberculosis the following results were 
secured — 



Cattle 


Treatment 

Died of 
tuberculosis or 
very seriously 
affected 

Unaffected or 
only slightly 
affected 


Inoculated with vaccine 

6 

13 

19 

Not inoculated or inoculated with 
control media 

S 

3 

11 

Total 

14 

i 

16 

30 


(The cattle were first inoculated with protective vaccine and then 
deliberately infected with serious quantities of tubercle germs.) 

Find the coefficient of association between inoculation and exemption 
from serious tuberculosis. 
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2.12 Criticise the following argument : " Nearly all the ri’s are B’s, and 
therefore A and B must be associated,” and state what suppressed premises 
would justify it in the following cases — 

” 99 per cent of the people who drink beer die before reaching 100 years 
of age. Therefore drinking beer is bad for longevity." 

” 99 per cent of the members who voted for the Army Estimates were 
military officers. Therefore it was unfair to suppose that the voting was 
unbiased.” 

" In every country where the sale of contraceptives is tolerated by the , 
Government the birth-rate is declining. Therefore contraception must j 
exert an influence on the birth-rate.” 

2.13 Write down in the form of the table of 2.1 the frequency groups 
when (1) all ri’s are B’s ; (2) all B’s are A's ; (3) all A 's are B’s and ^ 
B’s are k’s ; and the three similar tables when A and B are completely 
disassociated. 

2.14 Take the following figures for girls corresponding to those for boys 
in Example 2.9, page 33, and discuss them similarly, but not necessarily 
using exactly the same comparisons, to see whether the conclusion that 
" the connecting link between defects of body and mental dullness is the 
coincident defect of brain which may be known by observation of abnormal 
ni^e signs ” seems to hold good. 

^A, development defects ; B, nerve signs ; D, mental dullness. 


N 

10,000 

(AB) 

248 

{A) 

682 

(AD) 

307 

(S) 

850 

(BD) 

363 

P) 

689 

(ABD) 

128 


2.15 (Material from Census of England and Wales, 1891, vol. 3.) The 
following figures give the numbers of those suffering from single or com- 
bined infirmities : (1) for all males ; (2) for males of 55 years of age and 
over. 

A, blindness ; B, mental derangement ; C, deaf-mutism. 



(I) 

(2) 


(1) 

(2) 


All Males 

Males 55- 


AU Males 

Males 55- 

N 

14,053,000 

1,377,000 

(AB) 

183 

65 

{A) 

12,281 

5,538 

(AC) 

51 

14 

(B) 

45,392 

10,309 

(BC) 

299 

47 

(C) 

7,707 

746 

(ABC) 

11 

3 


Tabulate proportions per thousand, exhibiting the total association 
between blindness and mental derangement, and the partial association 
between the same two infirmities among deaf-mutes : (1) for males in 
general ; (2) for those of 55 years of age and over. Give a short verbal 
statement of the results. 
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2.16 (Material from same source as in Example 2.10). 

The death-rate from cancer for occupied and retired males in general 
(over 16) is 2-004 per thousand per annum, and for farmers 2*633. 

The death-rates from cancer for occupied males under and over 45 
respectively are 0*184 and 4*960 respectively. Of the farmers, 53*22 
per cent are over 45. 

Would you say that farmers were peculiarly liable to cancer ? 

2.17 A population of males over 15 years of age consists of 7 per cent 
over 65 years of age and 93 per cent under. The death-rates are 12 per 
thousand per annum in the younger class and 110 in the older, or 18*86 
in the whole population. The death-rate of males (over 15) engaged in 
a certain industry is 26*7 per thousand. 

If the industry be not unhealthy, what must be the approximate propor- 
tion of those over 65 engaged in it (neglecting minor differences of age 
distribution) ? 

2.18 Show that if A and B are independent, while A and C, B and C are 
associated, A and B must be disassociated either in the population of C’s, 
the population of y’s, or both. 

2.19 As ai) illustration of Exercise 2.18, show that if the following were 
actual data, there would be a slight disassociation between the eye-colours 
of husband and wife (father and mother) for the parents either of light- 
eyed sons or not-light-eyed sons, or both, although there is a slight positive 
association for parents at large. 

A light eye-colour in husband, B in wife, C in son — 


N 

1,000 

{AB) 

358 

(A) 

622 

(AC) 

471 

(B) 

558 

(BC) 

419 

(C) 

617 




2.20 Show that if {ABC)=^[afiy), [xBC)={APy), and so on (the case of 
“ complete equality of contrary frequencies ” of Exercise 1.6, page 15), 
A, B and C are completely independent if A and B, A and C, B and C 
are independent pair and pair. 

2.21 If, in the same case of complete equality of contraries. 


(AB) -NI4^di 

(AC) -NI4^8t 

(fiC)-N/4=«, 

show that 


^[(..BC) =2[(.lSr)-™3 
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so that the partial associations between A and B in the populations C and 
y are positive or negative according as 

. < 4^3 

N 

2.22 In the straight contests of a general election (contests in which one 
Conservative opposed one Socialist and there were no other candidates) 
66 per cent of the winning candidates (according to the returns) spent 
more money than their opponents. Given that 63 per cent of the winners 
were Conservatives, and that the Conservative expenditure exceeded the 
Socialist in 80 per cent of the contests, find the percentages of elections 
won by Conservatives (1) when they spent more and (2) when they spent 
less than their opponents, and hence say whether you consider the above 
figures evidence of the influence of expenditure on election results or no. 
(Note that if the one candidate in a contest be a Conservative-winner-who 
spends more than his opponent, the other must necessarily be a Socialist- 
loser-who spends less — and so forth. Hence the case is one of complete 
equality of contraries.) 

2.23 Given that {A) IN=(B) IN^{C) /N^x, and that {AB) IN = {AC) IN 
==y, find the major and minor limits to y that enable one to infer positive 
association between B and C, i.e. (BC) (N > x^. 

Draw a diagram on squared paper to illustrate your answer, taking x 
and y as co-ordinates, and shading the limits within which y must lie in 
order to permit of the above inference. Point out the peculiarities in the 
case of inferring a positive association from two negative associations. 

2.24 Discuss similarly the more complex case {A)jN=x, {B)IN=2x, 
{C)IN==3x— 

(1) for inferring positive association between B and C given {AB) /N 

={AC)IN==y. 

(2) for inferring positive association between A and C given (AB) /N 

=(J5C)/iV=y. 

(3) for inferring positive association between A and B given {AC) /N 

^{BC)IN^y. 

2.25 Draw a graph of the curve y—2x /(I -\-x^) for the range ~ 1 < % < 1 
and hence discuss the relationship between the coefficient of association Q 
and the coefficient of colligation Y, Hence show, graphically or otherwise, 
that the maximum difference between the two occurs when Q is ±0* 786 
approximately. 
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Manifold classification 

3.1 Instead of dividing the population under consideration into two parts 
by a simple dichotomy, we may also divide it into a number of parts by 
a similar process. For instance, we can extend the dichotomy of the 
population of men into “ those with blue eyes ” and " those not with blue 
eyes ” to a threefold division : " those with blue eyes ” " those with 
brown eyes,” and " those with neither blue nor brown eyes ” ; or into a 
fourfold division by adding a fresh category, " those with grey eyes ” ; 
and so on. 

Generally, our population may be divided first according to s heads, 
Ai, Ag, . . . As ; each of the classes so obtained into t heads, Bj, . . . 

Bt ; each of these into « heads, Cj, . . . C» ; and so on. 

This is called manifold classification. 

3.2 The general theory of manifold classification for n attributes is 
rather complicated, but its fundamental principles are very similar to 
those which apply to dichotomy. A straightforward extension of the 
methods of Chapter 1 will give the following results, which we are content 
to announce without a formal proof — 

(а) There are sx^Xmx ... ultimate classes. 

(б) The total number of classes, including N and the tiltimate classes, 
is (s+l)(f+l)(«+l) . . . 

(c) The data are consistent if, and only if, every ultimate class-frequency 
is not negative. 

(</) The data are completely specified by sxtxux ... algebraically 
independent class-frequencies. Even if all these are not given, it may be 
possible to set limits to the other class-frequencies. 

For example, if the population of the United Kingdom is classified 
geographically according to habitation in England, Wales, Scotland and 
Northern Ireland ; by eye-colour into blue, brown, grey, green and the 
remainder ; and by hair-colour into black, fair, red and the remainder ; 
there will be 150 classes altogether, expressible in terms of 80 independent 
class-frequencies. 

3.3 Data so completely specified are very rare, and an elaborate discussion 
of the general case wo^d hardly be justified by its practical value. For 
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the remainder of this chapter, therefore, we shall be concerned solely 
with the case of two characteristics, A and B. 

Contingency tables 

3.4 Let ns suppose that the classihcation of the ^’s is $-fold and that 
of the B’s is /-fold. Then there will be s/ classes of the type AmBn. 

Generalising slightly the notation of previous chapters, let the frequency 
of individuals Am be denoted by (Am) and of individuals AmB^ by (AmBn). 
The data can then be set out in the form of a table of / rows and s columns 
as follows — 

TABLE 3.1 


Attribute 

Ax 

A, - 



A, 

Totals 


(AxBx) 

(A,B,) - 

- (Ax-xBx) 

(A,B^) 

(Bx) 

Bn 

{AxBi> 

(AnBn) - 

- {A.-xBt) 

(AtBJ 


i 

Bt 

(AxBt) 

(A,B,) - 

- {A,.xBt) 

(AxB,) 

(Bt) 

Totals 

(Ax) 

(^i) - 



(An) 

N 


In this table the frequency of the class AmBn is entered in the com- 
partment common to the mth column and the nth row ; the totals at the 
ends of rows and at the feet of columns give the first order frequencies, 
i.e. the numbers of Am’s and Bn's ; and finally, the grand total in the 
bottom right-hand comer gives the whole number of observations. 

Such a table, is called a contingency table. It is a generalised form 
of the fourfold (2 x 2-fold) table in 2.1. 

Example 3.1 — In Table 3.2 below the classification is 3 X 4-fold: 
the eye-colours are classed under the three heads " blue," " grey or 
green " and " brown,” while the hair-colours are classed under four 
heads, " fair,” “ brown,” “ black ” and " red.” Taking the first row. 


TABLE 3.2 — Hair- and eye-colonrs of 6800 males in Baden 

(Ammon, Zur AfMropologie der Badenei^ 


Attribute 

Hair-colour 

Fair Brown Black Red 

Total 

Eye-colour 

Blue. 

Grey or Green . 

Brown 

1768 807 189 47* 

946 1387 746 53 

ns 438 288 16 

2811 

3132 

857 

Total 

2829 2632 1223 116 

6800 
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the table tells us that there were 2811 men with blue eyes noted, of whom 
1768 had fair hair, 807 brown hair, 189 black hair and 47 red hair. 
Similarly, from the first column, there were 2829 men with fair hair, of 
whom 1768 had blue eyes, 946 grey or green eyes and 115 brown eyes. 


Association in contingency tables 

3.5 For the purpose of discussing the nature of the relation between 
the A’s and the B’s, any such table may be treated on the principles of 
the preceding chapter by reducing it in different ways to a 2 x 2-fold form. 
It then becomes possible to trace the association between any one or more 
of the ,4’s and any one or more of the B’s, either in the population at large 
or in populations limited by the omission of one or more of the ,4’s, of the 
B’s, or of both. 

If, for example, we desire to trace the association between a lack of 
pigmentation in eyes and in hair, rows 1 and 2 may be pooled together as 
representing the least pigmentation of the eyes, and columns 2, 3 and 4 
may be pooled together as representing hair with a more or less marked 
degree of pigmentation. We then have — 


Proportion of light-eyed withU7i4/5943 =46 per cent 
fair hair . . . J 

Proportion of brown-eyed withl j jg _ jg 

fair hair . . . J ” 


The association is therefore well marked. For comparison we may trace 
the corresponding association between the most marked degree of pigmen- 
tation in eyes and hair, i.e. brown eyes and black hair. Here we must add 
together rows 1 and 2 as before, and pool columns 1, 2 and 4 — the column 
for red being really misplaced, as red represents a comparatively sUght 
degree of pigmentation. The figures are — 


Proportion of brown-eyed withl 
black hair . . . J 

Proportion of light-eyed with\ 
black hair . . . j 


288 /857 =34 per cent 
935/5943=16 


The association is again positive and well marked, but the difference 
between the two percentages is rather less than in the last case. 


3.6 The mode of treatment adopted in the preceding two paragraphs 
rests on first principles and, if fully carried out, gives us all the information 
possible about the associations of the two attributes. At the same time, 
it is laborious if s and I are at all large. Moreover, in practical work we are 
often concerned, not with the associations of individual A’s with individual 
B’s, but with finding the answer to a general question of the type : Are the 
A’s on the whole distinctly dependent on the B’s, and if so, is this depend- 
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ence very close, or the reverse ? In fact, what we want is a coefficient 
which will summarise the general nature of the dependence. We will 
proceed to discuss two such coefficients. 


Coefficients of contingency 

3.7 If the .4’s and B’s be completely independent in the population at 
large, we must have for all values of m and n — 

. . . ( 3 . 1 ) 


If, however, A and B are not completely independent, (AmBn) and {AmBn)o 
will not be identical for all values of m and «. Let the difference be given 
by 

dmn=={AmBn)-iA„Bn)„ . . . ( 3 . 2 ) 

Let us note in passing the following properties of these quantities — 

(1) In the first place, Smn is not equal to Snm. 

(2) In the second place, the <J’s are not all algebraically independent. 
We have, in fact, for any particular m — 


={AmBl) - + {AmBi) - 

^{A.,)J-^^[{B^)+(B2}+ . . . +(R); 


+ (AmB,)- 


(A^(b) 
N ' 


=0 ( 3 . 3 ) 

A similar relation is true for any particular n. 

Now there are st ^-quantities. In virtue of the relationship we have 
just proved, for any particular m only {t—1) of the ^-quantities Smn are 
independent. Similarly, for any n only (s— 1) are independent. Hence 
the total number of independent i’s is (s — 1)(< — 1). 


3.8 These ^-quantities indicate the extent of the associations, and we 
expect a summarising coefficient to be built up from them in some way. 
It would, however, be useless to add them together, for in virtue of the 
relation of the preceding paragraph the sum is zero. We wish to construct 
a coefficient which shall be independent of the signs of the ^-numbers. 

We therefore define 



and call x* the "square contingency." 
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We then write — 

. . . . (3.5) 

and call the mean-square contingency/* 

Clearly and being the sums of squares, cannot be negative. They 
vanish if, and only if, every <S-number vanishes, in which case A and B 
are independent. 

Pearson’s coefficient of mean-square contingency 

3,9 The quantity is not quite suitable in itself to form a coefficient, 
because its limits vary in different cases. Karl Pearson therefore proposed 
the coefficient C, defined by 



This is called the coefficient of mean-square contingency. In general, 
no sign should be attached to the root, for the coefficient merely shows 
whether two characters are or are not independent ; but in certain cases a 
conventional sign may be used. Thus, in Table 3.2 slight pigmentation 
of eyes and hair appear to go together, and the contingency may be 
regarded as positive. If slight pigmentation of eyes had been associated 
with marked pigmentation of hair, the contingency might have been 
regarded as negative. 

3.10 The coefficient C has one serious disadvantage. Although, as 
may be seen from its definition, it increases with towards a limit 1, it 
never reaches that limit. In fact, the maximum value which it can attain 
depends on s and t, and reaches unity only for an infinite number of classes. 
This may be briefly illustrated as follows. Replacing finm in equation 
(3.4) by its value in terms of (AmBm) and (AwBn)©. we have — 



and therefore, denoting the summation by S, 

.... (3.8) 

Now suppose we have to deal with a <x<-fold classification in which 
(^w) —(Bm) for all values of m ; and suppose, further, that the association 
between Am and Bm is perfect, so that (AmBm) ==(Am) =(Bm) for all values 
of m, the remaining frequencies of the second order being zero ; all the 
frequency is then concentrated in the diagonal compartments of the table. 
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and each contributes N to the summation S. The total value of S is 
accordingly tN, and the value of C — 



This is the greatest possible value of C for a symmetrical f xt-fold classi- 
fication, and therefore, in such a table, for — 

t= 2, C cannot exceed 0*707 
/= 3 0*816 

t= 4 0*866 

5 0*894 

6 0*913 

t= 7 0*926 

8 0*935 

t= 9 0*943 

t=10 0*949 

3.11 Hence, coefficients calculated from different systems of classification 
are not, strictly speaking, comparable. This is clearly undesirable. Two 
coefficients calculated from the same data classified in two different group- 
ings ought not to be very different. 

It is as well, therefore, to restrict the use of the C-coefficient to 5 x 5 or 
finer groupings. At the same time, the classification must not be made too 
fine, or the value of the coefficient is largely affected by causal irregularities 
arising from sampling fluctuations.^ 


Tschuprow’s coefficient 

3.12 To remedy the defect to which we have just referred, Tschuprow 
proposed the coefficient T, defined by 


This coefficient varies between 0 and 1 in the desired manner when 5= =f. 
We have 


C»= 


Ifi* 

1 +^* 


and conversely. 


r«v{(5-i)(f- i)} 

c* 


(3.10) 

(3.11) 


* Karl Pearson discussed a correction ** to be made to C calculated from coarsely 

grouped data. The use of such corrections depends to some extent on assumptions 
about the population, and may be regarded as attempts to bring the value of C closer 
to a putative coemcient of correlation (cf. 10*20). 
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Calcolation of C and T 

3.13 The calculation of C and T is simplified by the use of equation 
(3.8), which enables us to replace the calculation of the S’? by calcula- 
tions based on frequencies of types (Am), (Bn) and (AmBn). All these 
quantities are contained in the contingency tables. The following example 
will illustrate the method — 

Example 3.2 — Consider the data of Table 3.2. (The classification is 
only 3 X 4-fold and is therefore rather crude for calculating C, but it will 
serve as an illustration of the form of the arithmetic.) 

We require first of all the quantities (AmBn)oi i.e. the "independence” 
values. These are calculated directly from their definition 


(./im£n)g = 


(Am){Bn) 

N 


and thus the value for the compartment in the »«th column and «th row 
is the product of the total frequencies in that column and row divided by 
the whole frequency, e.g. (.4j.Bi)g=2829 x 2811 /6800=1169, and so on. 

It is convenient to tabulate the frequencies so obtained in a second 
contingency table, as in Table 3.3. 

TABLE 3.3 — ^Independence values of the frequencies for Table 3.2 


Attribute 

Hair-colour 

Fair Brown Black Red 

Eye-colour 

Blue. ..... 

Grey or Green .... 
Brown . ... 

1169 1088 506 48 0 
1303 1212 563 53*4 
357 332 154 14*6^ 


We now calculate the quantities 


(1768)* /1 169 

2673-9 

(946)*/1303 

686-8 

(115)*/357 

37-0 

(807)*/1088 

598-6 

(1387)* /1212 

1587-3 

(438)* /332 

577-8 

(189)* /506 

70-6 

(746)* /563 

988-5 

(288)* /154 

538-6 

(47)*/48-0 

46-0 

(53)*/53*4 

52-6 

(16)*/I4-6 

17-5 

Totalis 

«787S.2 
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From equation (3.8) 


^ fs^ Im^ 
V S V7875- 


and 




Vo- 1365 =0-37 
C* 

(l-C*)V(s~l)(/-.l) 


0-1365 

0-8635 Ve 


7^== Vo -0645 


= 0-25 

The squares in such work may conveniently be taken from Barlow's 
Tables of Squares, Cubes, etc,, or logarithms may be used throughout — 
five-figure logarithms are quite sufiRcient. 

It will be seen that T is less than C. This is not always true. Which- 
ever coefficient we use, however, the contingency between pigmentation 
of hair and eye is evident. 

3.14 While such coeflftcients of contingency are a great convenience 
in many forms of work, their use should not lead to a neglect of the more 
detailed treatment of 3.5, Whether the coefficients be calculated or no, 
every table should always be examined with care to see if it exhibits any 
apparently significant peculiarities in the distribution of frequency, e.g. 
in the associations subsisting between Am and Bn in limited populations. 
A good deal of caution must be used in order not to be misled by casual 
irregularities due to paucity of observations in some compartments of 
the table, but important points that would otherwise be overlooked will 
often be revealed by such a detailed examination. 

3.15 Suppose, for example, that any four adjacent frequencies, say 

{AM 

are extracted from the general contingency table. If these are considered 
as a table exhibiting the association between Am and Bn in a population 
limited to A„ A„^.j B„ B„_^, alone, the association is positive, negative or 
zero according as (A„B„) J(A„^B„) is greater than, less than, or equal 
to the ratio (Am^n+i) The whole of the contingency table 

can be analysed into a series of elementary groups of four frequencies like 
the above, each one overlapping its neighbours, so that an s x f-fold table 
contains (s— 1)(<— 1) such " tetrads,” and the associations in them all 
can be very quickly determined by simply tabulating the ratios like 
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(AmBn) (-^^Bn+i) /(-^m+iBn+i). ctc.. Or perhaps better, the 

proportions (A„B„)/|(^„B„)4-(^„^jB„)|, etc., for every pair of columns 
or of rows, as may be most convenient. Taking the figures of Table 3.2 
as an illustration, and working from the rows, the proportions run as 
follows— 

For rows 1 and 2 For rows 2 and 3 


1768/2714 

0-651 

946/1061 

0-892 

807 /2194 

0-368 

1387/1825 

0-760 

189/935 

0-202 

746/1034 

0-721 

47/100 

0-470 

53/69 

0-768 


In both cases the first three ratios form descending series, but the fourth 
ratio is greater than the second. The signs of the associations in the six 
tetrads are, accordingly, 

+ + - 

+ + - 

The negative sign in the two tetrads on the right is striking, the more so 
as other tables for hair- and eye-colour, arranged in the same way, exhibit 
just the same characteristic. But the peculiarity will be removed at once 
if the fourth column be placed immediately after the first : if this be done, 
i.e. if ** red be placed between “ fair '' and ** brown instead of at the 
end of the colour-series, the sign of the association in all the elementary 
tetrads will be the same. The colours will then run fair, red, brown, 
black, and this would seem to be the more natural order, considering the 
depth of the pigmentation. 

Isotropic contingency tables 

3.16 A distribution of frequency of such a kind that the association 
in every elementary tetrad is of the same sign, possesses several useful 
and interesting properties, as shown in the following theorems. It will be 
termed an isotropic distribution. 

(1) In an isotropic distribution the sign of the association is the same not 
only for every elementary tetrad of adjacent frequencies, hut for every set of 
four frequencies in the compartments common to two rows and two columns, 
e.g, [AmBn-^q), {^Am^pBn-\-q)* 

For suppose that the sign of association in the elementary tetrads is 
positive, so that 

(i4„jBn)(v4,n+i^n+l) ^ 

and similarly. 

Then multiplying up and cancelling, we have — 

n+l) 

That is to say, the association is still positive though the two columns 
and are no longer adjacent. 
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(2) An isotropic distribution remains isotropic in whatever way it may 
he condensed by grouping together adjacent rows or columns. 

Thus from tlie first and third inequalities above we have, adding — 

that is to say, the sign of the elementary association is unaffected by 
throwing the (w+l)th and (w+2)th columns into one. 

(3) As the extreme case of the preceding theorem, we may suppose 
both rows and columns grouped and regrouped until only a 2 x 2-fold 
table is left ; we then have the theorem — 

If an isotropic distribution be reduced to a fourfold distribution in any 
way whatever by addition of adjacent rows and columns, the sign of the 
association in such fourfold table is the same as in the elementary tetrads of 
the original table. 

The case of complete independence is a special case of isotropy. For if 
{AfnBn)^(Am){Bn)IN 

for all values of m and n, the association is evidently zero for every tetrad. 
Therefore the distribution remains independent in whatever way the 
table be grouped, or in whatever way the population be limited by the 
omission of rows or columns. The expression “ complete independence 
is therefore justified. 

From the work of the preceding section we may say that Table 3.2 
is not isotropic as it stands, but may be regarded* as a disarrangement of 
an isotropic distribution. It is best to rearrange such a table in isotropic 
order, as otherwise different reductions to fourfold form may lead to 
associations of different sign, though of course they need not necessarily 
do so. 

3.17 The following will serve as an illustration of a table that is not 
isotropic and cannot be rendered isotropic by any rearrangement of the 
order of rows and columns — 

TABLE 3.4 — Showing the trequendee of different comhlnations of 
eye-colours in father and son 

1. Blue 2. Blue-green, grey 3. Dark grey, hazel 4. .Brown 
(Data ol Galton, from Karl Pearson, Fkil. Trans., A. 1900, 19S, 138; classification condeniad.) 


Son's 


Father's Eye-colour 



Eye- 





Total 

colour 

1 

2 

3 

4 


1 

194 

70 

41 

30 

335 

2 

83 

124 

41 

36 

284 

3 

25 

34 

55 

23 

137 

4 

56 

36 

43 

109 

244 

Total 

358 

264 

180 

198 

1000 
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The following are the ratios of the frequency in column m to the sum 
of the frequencies in columns m and w+1 — 


I and 2 

Columns 

2 and 3 

3 and 4 

0-735 

0-631 

0-577 

0-401 

0-752 

0-532 

0-424 

0-382 

0-705 

0-609 

0-456 

0-283 


The order in which the ratios run is different for each pair of columns, 
and it is accordingly impossible to make the table isotropic. The dis- 
tribution of signs of association in the several tetrads is — 

+ — + 

- + - 

- - + 

The distribution is a curious one, the associations in tetrads round the 
diagonal of the whole table being so markedly positive, and those in the 
immediately adjacent tetrads equally markedly negative. Neglecting the 
other signs, this is the effect that would be produced by taking an isotropic 
distribution and then increasing the frequencies in the diagonal compart- 
ments by a sufficient percentage. Comparison of the given table with 
others from the same source shows that the peculiarity is common to the 
great majority of the tables, and accordingly its origin demands explana- 
tion. Were such a table treated by the method of the contingency 
coefficient, or a similar summary method, alone, the peculiarity might not 
be remarked. 

Complete independence in contingency tables 

3.18 It may be noted that in the case of complete independence the 
distribution of frequency in every row is similar to the distribution in the 
row of totals, and the distribution in every' column similar to that in the 
column of totals ; for in, say, the column the frequencies are given by 
the relations — 

(AnB,) {A„B;j (^„Ba) =^(^0 

and so on. This property is of special importance in the theory of variables. 
H(»nogeneous and heterogeneous classification 

3.19 The classifications both of this and of the pre<^ng chapters 
have one important characteristic in common, viz. that they are, so to 
speak, " homogeneous " — ^the principle of division being the same for all 
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the sub-classes of any one class. Thus A's and a*s are both subdivided 
into and A^s, . . . AsS into Bi's, B^s, . . . B/s, and 

so on. Clearly this is necessary in order to render possible those compari-^ 
sons on which the discussions of associations and contingencies depend. 
If we only know that amongst the .4*s there is a certain percentage of B's, 
and amongst the a’s a certain percentage of C's, there are no data for any 
conclusion. 

Many classifications are, however, essentially of a heterogeneous 
character, e.g. biological classifications into orders, general and species ; 
the classifications of the causes of death in vital statistics and of occupa- 
tions in the census. To take the last case as an illustration, the 1931 
census of England and Wales divides occupations into 32 classes. Some 
of these are not further subdivided — e.g. Fishermen Others are sub- 
divided into further general classes ; e.g. Class 1 is divided into (1) 
Employers, (2) Fumacemen, (3) Foundry Workers, (4) Smiths, (5) Metal 
Machinists, (6) Fitters and (7) Other Workers. These sub-heads are 
necessarily peculiar to the class under which they occur and their number 
is arbitrary and variable, and different for each main heading ; but so long 
as the classification remains purely heterogeneous, however complex it may 
become, there is no opportunity for any discussion of causation within the 
limits of the matter so derived. It only when a homogeneous division 
is^ in some way introduced that we can begin to speak of associations and 
contingencies, 

3.20 This may be done in various ways according to the nature of 
the case. Thus the relative frequencies of different botanical families, 
genera or species may be discussed in connection with the topographical 
characters of their habitats — desert, marsh or heath — and wc may observe 
statistical associations between given genera and situations of a given 
topographical type. The causes of death may be classified according to sex, 
or age, or occupation, and it then becomes possible to discuss the associa- 
tion of a given cause of death with one or other of the two sexes, with a 
given age -group or with a given occupation. Again, the classifications of 
deaths and of occupations are repeated at successive intervals of time ; and 
if they have remained strictly the same, it is also possible to discuss the 
association of a given occupation or a given cause of death with the earlier 
or later year of observation — ^i.e. to see whether the numbers of those 
engaged in the given occupation or succumbing to the given cause of death 
have increased or decreased. But in such circumstances the greatest 
care must be taken to see that the necessary condition as to the identity of 
the classifications at the two periods is fulfilled, and unfortunatelv it very 
seldom is fulfilled. All practical schemes ol classification are subject to 
alteration and improvement from time to time, and these alterations, 
however desirable in themselves, render a certain number of comparisons 
impossible. Even where a classification has remained verbally the same, 
it is not necessarily really the same ; thus in the case of the causes of death. 
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improved methods of diagnosis may transfer many deaths from one heading 
to another without any change in the incidence of the disease, and so bring 
about a virtual change in the classification. In any case, heterogeneous 
classification should be regarded only as a partial process, incomplete untD 
a homogeneous division is introduced either directly or indirectly, e.g. by 
repetition. 

Manifold classification as a series of dichotomies 

3.21 From a theoretical point of view, manifold classification can be 
regarded as compounded of a series ol dichotomies. Take, for example, a 
case we have already considered, that of the classification of a population of 
men according to the eye-colours blue, grey, brown and green. We could 
have produced this fourfold division by three dichotomies. In fact, 
dividing the population first into those with blue eyes and those with not- 
blue eyes we get two classes. Then dividing again into those with brown 
eyes and those with not-brown eyes we get four classes. This operation on 
the class of blue-eyed men, however, results in one zero class, because there 
are no men with blue eyes which are at the same time brown, and one class 
which is, in fact, the class of blue-eyed men. Virtually, therefore, we have 
three classes : those with blue eyes, those with brown eyes, and the re- 
mainder. If we now dichotomise each of these into those with grey eyes 
and those with not-grey eyes, we shall again get, neglecting the zero classes, 
the four classes of the manifold classification. 

3.22 It follows from this that any manifold classification can be regarded 
as produced by a succession of divisions in which, at each stage, each 
individual could fall into one of two alternatives, A or not- A, 

Put in another way, this means that the possible answers to an un- 
ambiguous question can be reduced to a succession of answers of either 
** yes ” or no." For instance, suppose the question is, “ How old are you, 
in years? " We can replace this question by the succession of questions, 
" Are you one year old ? " " Are you two years old ? " . . . “ Are you 
120 years old ? " An answer of " 47 " to the first-mentioned question can 
then be expressed as an answer of " No " to the first 46 of these questions, 
" Yes " to the 47th and ** No " to the rest. 

Similarly, an answer to the question, " What is your name ? " can be 
reduced to the questions, " Is the first letter of your name A ? " ** Is the 
first letter B ?"..." Is the second letter A ? " and so on. Replies to 
a more general question can be reduced to the same form by a convenient 
classification ; e.g. the replies to the question, " Are you in favour of war?" 
can be classified in the four forms " Favourable without qualification," 
" Favourable with some qualification." Unfavourable without qualifica- 
tion," " Unfavourable with some qualification," and the answers to the 
questions can be reduced to answers * yes " oi ** no " to the questions, "Are 
you, without qualification, in favour of war ? " and so on. 
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Recording classified information on punched cards 
3.23 The information about an individual, considered as a member 
of a population, is information whether he does or does not fall into the 
alternative classes which, as we have just seen, compose the most general 
homogeneous classification of the population. If we imagine each indi- 
vidual filling in a questionnaire about himself, the totality of answers may, 
by suitably expressing the questions, be expressed as a number of ** yes's " 
and no's," and these replies express all the information about the 
individual. 

This simple fact allows us to record the data in a most convenient way. 
Each individual is allotted a card, which is divided into a number of cells. 
Each cell corresponds to one of the dichotomies or simple questions the 
answers to which constitute the information. If the answer is "Yes," a 
hole is punched in the cell ; if the answer is " No," the cell is left un- 
touched. 

The card of any individual wiU thus be like a complicated bus ticket, 
with holes punched in various places. The punching is usually performed 
either by hand with a ticket collector's punch, or with a machine similar 
in principle to the typewriter. The totality of punched cards forms a 
miniature of our population — each individual has a card on which is 
recorded the whole of the information about liim. 

The use of this system lies in the fact that punched cards are easily 
handled and sorted by machinery. If, for example, we want to know a 
particular class-frequency, we can adjust certain electrical, pneumatic or 
mechanical stops, and the machine will segregate all the cards in the class 
and count them for us. 

3.24 A similar device has been applied to the sorting of data by hand. 
A card is prepared with a row of circular holes punched all the way round, 
near its edge but so that no hole is open to the edge. Each hole corre- 
sponds to a dichotomy or a simple question. When preparing the card, if 
the individual falls into the A class, or the answer to the question is " Yes," 
a piece is clipped out of the card so that the hole is now open to the edge. 
If the individual falls into the not-A class, or the answer to the question is 
" No," the hole is left alone. 

To separate the A's from the not-A's, or the " yes " cards from the 
" no " cards, they are arranged in a vertical plane so that corresponding 
cells are similarly placed. A skewer is then inserted in the appropriate 
hole and lifted. The not-A cards are lifted out, whilst the A cards fall 
away, since the piece of card between the hole and the edge has been cut 
away. By repeating the operation with the skewer in the appropriate 
holes we can isolate the cards in any given class. These can then be 
counted and the size of the class-frequency determined. 

The labour of punching cards and the expense of machinery is 
justified only when the number of individuals is large and the number of 



MANIFOLD CLASSIFICATION 63 

ultimate classes is also large. This arises, for example, in the taking of 
a census of population. 

Numerically defined attributes 

3^ The attributes we have instanced in the foregoing pages have 
usually been of a qualitative kind. The methods described are, however, 
applicable to data classified on a numerical basis. Consider, for example, 
the following table — 


TABLE 3.5 — FamUiet defldent in room space 
Their number in 95 crowded London wards 
(Census of 1931, Housing Report, p. xxxii) 


Families 

deficient 

by 

Standard room requirement 
(rooms) 

2 3 4 5 6 7 8 

Totals 

1 room 

12,999 18,198 7,724 2,170 164 19 .... 

41,274 

2 rooms 

3,054 4,479 1,448 221 15 1 1 

9,218 

3 rooms 

310 508 106 4 1 

929 

4 rooms 

10 21 4 .... 

35 

Totals 

12,999 21,252 12,513 4,136 512 42 2 

51,456 


The distinction between successive rows and columns is not quite of the 
kind of Table 3.2. In the latter, for instance, we drew a line between black 
hair and brown, a line which could be drawn by anybody who was not 
colour-blind, although there may be border-line cases of mixed colours 
which would present difficulty. But in Table 3.5 above the line is drawn 
by counting — a much more precise operation. Moreover, the rows and 
columns have a certain natural order given by the numerical sequence. 
It would seem absurd to put the column which is headed “ two rooms ” 
between those headed " three rooms ” and " four rooms,” but in Table 3.2 
there is no a priori reason for putting ” black " between " brown ” and 
” red." 

3.27 We might also have a contingency table in which the attributes 
were measurable quantities, and the rows and columns of the table de- 
termined by ranges of tho^ quantities. This, again, is slightly difierent 
from the case of the previous paragraph, for these ranges are to a large 
extent arbitrary, whereas in Table 3.5 the indivisible nature of the room 
compels us to count in units of at least one room. 

3.28 Finally, we may have a table which is given by one qualitative 
attribute and one quantitative attribute. Consider, for example, the 
following— 
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TABLE 3.6 — Weight and mentality In a sdection of criminals 

(Data from M. H. Whiting, ** On the Assodation of Temperature, Pulse and Respiration with Physique and 
Inteiligrace in Cnminals,** momt tnka ^ 1912, 11, 1) 


Mentality 

90-120 

Weight (lb) 

120-130 130-140 140-150 

150 

upward 

Totals 

Normal 

21 

51 

94 

106 

124 

396 

Weak 

15 

18 

34 

15 

15 

97 

Totals 

36 

69 

128 

121 

139 

493 


3.29 The methods of the previous chapters are applicable also to such 
tables. Numerically measurable quantities may, however, be treated by 
other methods, to which we shall come in due course. We mention the 
point here in order to remove any possible idea that the theory of attributes 
is concerned solely with qualitative classification, and is not appropriate 
to the more precise data given by a numerically assessable attribute. 


SUMMARY 

1. The division of a population according to an attribute into a number 
of heads is called manifold classification. This is an extension of the idea 
of dichotomy, in which the population is divided into two parts only. 

2. Manifold classification according to two attributes A and B gives 
rise to a contingency table. 

3. Association in a contingency table may be examined by reducing it 
in a number of ways to a 2 x 2 table. 

4. We define 


^mn — (^m^n) m^n)o 

The " square contingency '' is given by — 



The " mean-square contingency " by — 
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5. Pearson’s " coef&cient of mean-square contingency " is defined by — 



1+^* 


6. Tschuprow’s " coefficient of contingency ’’ is defined by — 

7. Certain types of table, known as isotropic contingency tables, possess 
special features of some importance. 

8. Any manifold classification may be regarded as a succession of 
dichotomies. This fact is the basis of the use of punched cards for record- 
ing and analysing statistical data. 

9. Manifold classification may arise not only from an attribute which 
is specified under heads of a qualitative kind, but also from a quantitative 
attribute specified by counting or measurement. 


EXERCISES 

3.1 (Data from Karl Pearson, " On the Inheritance of the Mental and 
Moral Characters in Man," Jour, of the Anthrop. Inst., vol. 33, and 
Biometrika, vol. 3.) Find the coef&cient of contingency (coefficient of 
mean-square contingency) for the two tables below, showing the resem- 
blance between brothers for athletic capacity and between sisters for 
temper. Show that neither table is even remotely isotropic. (As stated 
in 3.11, the coefficient of contingency should not as a rule be used for 
tables smaller than 5 x 5-fold : these small tables are given to illustrate 
the method, while avoiding lengthy arithmetic.) 


A. Athletic capacity 


Second Brother 

Athletic 

First Brother 

Betwixt 

Non- 

athletic 

Total 

Athletic 


20 

140 

1066 

Betwixt 


76 

9 

105 

Non-athletic 


9 

370 

519 

Total 

1066 

105 

519 

1690 
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B. Teo^cr 


Second Sister 

Quick 

First Sister 
Good- 
natured 

Sullen 

Total 

Quick 

198 

177 

77 

452 

Good-natured 

177 

996 

165 

1338 

Sullen 

77 

165 

120 

362 

Total 

452 

1338 

362 

2152 


3.2 Calculate T and C for the following table, and trace the association 
between the progress of building and the urban character of the district — 


Hogces in England and Wales 

(Cmmu ^ 1901. Summ ary Tail* X, 000*s omitted) 



Inhabited 

Unin- 

habited 

Building 

Total 

Adm. County of London . 

571 

40 

5 

616 

Other urban districts 

4064 

285 

45 

4394 

Rural districts 

1625 

124 

12 

1761 

Total for England and Wales 

6260 

449 

62 

6771 


3.3 Show that for a given s and t. C and T are equal for two values of 
one of which is zero ; that for between these values C > T ; and 

that for greater than the higher value T > C. 

3.4 Find whether the following contingency table is isotropic, and if it 
is not, ascertain whether it can be arranged in an isotropic form — 



An 


A, 

An 

A, 

Totals 

Bt 

90 

43 

17 

27 

16 

193 

Bn 

235 

88 

44 

60 

40 

467 

B, 

300 

103 

54 

71 

48 

576 

Totals 

625 

234 

115 

158 

104 

1238 
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3.5 Calculate C and T for the table of the previous example. 

3.6 Show that in a positively isotropic contingency table, 


’ > — and is > 

(A,B,), 

3.7 1,000 subjects of English, French, German, Italian and Spanish 
nationality were asked to name their preferences among the music of those 
five nationalities. The results were as follows (l=English, 2=French, 
3=German, 4=Italian, 5— Spanish) — 


Nationality 


Nationality of music preferred 



subject 

1 

2 

3 

4 

5 


1 

32 

16 

75 

47 

30 

200 

2 

10 

67 

42 

41 

40 

200 

3 

12 

23 

107 

36 

22 

200 

4 

16 

20 

44 

76 

44 

200 

5 

8 

53 

30 

43 

66 

200 

Totals 

78 

179 

298 

243 : 

202 

1000 


Discuss the association between the nationality of the subject and the 
nationality of the music preferred. 

3.8 In Table 3.6 calculate C and T, and discuss the light thrown by this 
table on the association between physique and intelligence in the criminals 
of the data. 

3.9 Show that for a 2x2 contingency table in which the frequencies are 

and 

V* - 

(a +^) (c (a +e) 

and hence find C and T in terms of a. b, c, d. 

3.10 In a paper discussing whether laterality of hand is associated 
with laterality of eye (measured by astigmatism, acuity of vision. 
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etc.) T. L. Woo obtained the following results {Biometrika, voL 20A, 
pp. 79-148)- 


Manual laterality 
as determined 
by a balancing 
test 

Ocular laterality for general astigmatism 

** Left-eyed Ambiocular Right-eyed ” 

Totals 

Left-handed 

34 

62 

28 

124 

Ambidextrous . 

27 

28 

20 

75 

Right-handed . 

57 

105 

52 

214 

Totals . 

118 

195 

100 

413 


Show that laterality of eye is only slightly associated with laterality of 
hand. 








CHAPTER FOUR 


FREQUENCY-DISTRIBUTIONS 


Variables 

4.1 As we emphasised at the close of the last chapter, the methods 
of the theory of attributes are applicable to all observations, whether 
qualitative or quantitative. We have now to proceed to the consideration 
of special processes adapted to the treatment of quantitative data, but 
not as a rule available for the discussion of purely qualitative observations 
(though there are some important exceptions to tWs statement, as suggested 
in 1,2). 

A measurable quantity which can vary from one individual to another 
is called a variable,^ and this section of our work may be termed the Oieory 
of variables. 

As common examples of variables which are subject to statistical 
treatment we may cite birth- and death-rates, prices, wages, barometer 
readings, rainfall records, and measurements or enumerations (e.g. of 
glands, spines or petals) on animals or plants. 

Quantities which can take any numerical value within a certain range 
are called continuous variables. Such, for example, are birth-rates and 
barometric readings. Quantities which can take only discrete values 
are called discontinuous variables. This class, for instance, would include 
data of the number of petals on flowers or the number of rooms in a house. 

Frequency-distributions 

4.2 If some hundreds or thousands of values of a variable have been 
noted merely in the arbitrary order in which they occur, the mind cannot 
properly grasp the significance of the record. We must condense the 
data by some method of ranking or classification before their characteristics 
can be comprehended. 

One way of doing this would be to dichotomise the data by classifying 
the individuals as A’s or not-il’s, according as the value of the variable 
exceeded or fell short of some given value. But this is too crude, and 
the sacrifice of information is too great. A manifold classification, 
however, avoids the crudity of the dichotomous form, since the classy 
may be made as numerous as we please. Moreover, numetiesd measure- 
ments lend themselves with peculiar readiness to a manifold clasrification. 


* It k also called a variate. We shall use the two terms as sjmonyiDOVS. 
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for the class limits can be conveniently and precisely defined by assigned 
values of the variable. 

4.3 For convenience, the values of the variable chosen to define the 
successive classes should be equidistant, so that the numbers of obsova- 
tions in different classes are comparable. 

The interval chosen for classifying is called the class-interval, and the 
frequency in a particular class-interval is called a class-frequency. 

Thus, for measurements of stature, the class-interval iriight be 1 inch, 
or 2 centimetres, and the class-frequencies would be the numbers of indi- 
viduals whose statures fell within each successive inch or each successive 
2 centimetres of the scale ; returns of birth- or death-rates might be 
grouped to the nearest unit per thousand of the population ; returns of 
wages might be classified to the nearest shilling, or, if it is desired to obtain 
a more condensed table, to the nearest five or ten shillings. Discon- 
tinuous variables to a great extent determine their own class-intervals, 
which must either be equal in width to the unit amount of variation, or 
equal to some multiple of it. For example, in enumerations of the 
number of rooms in a house we naturally take our class-interval to be 
one room ; in enumerations of the peteds on a flower we may take one 
petal or, if the range of variation is very great, say five petals or more. 

4.4 The manner in which the class-frequencies are distributed over 
the class-intervals is spoken of as the frequency-distribution of the variable. 

A few illustrations will make clearer the nature of such frequency- 
distributions, and the service which they render in summarising a long 
and complex record. 


TABLE 4.1 — Showing the number of local government areas in En^nd with tpedBed 
birth-rates per Oioosand of pt^ulation 

(Material from the Registrar-Geqeral’s Statistical Review of Englaod and Wales lor 1933) 



Number of districts 


Number of districts 

Birth-rate 

with birth-rate 

Birth-rate 

with birth-rate 


between 


between 


limits stated 


limits stated 

1-5- 2*5 

1 

13-5-14*5 

271 

2-5- 3-5 

2 

14 -5-15 -5 

190 

3-5- 4-5 

2 

15-5-16-5 

127 

4-5- 5-5 

3 

16-5-17-5 

89 

5-5- 6*5 

7 

17-5-18-5 

78 

6-5- 7*5 

9 

18-5-19-5 

37 

7-5- 8-5 

14 

19-5-20-5 

21 

8 5- 9-5 

41 

20*5-21-5 

17 

9‘5-10-5 

83 

21*5-22-5 

4 

105-11-5 

131 

22-5-23*5 

4 

U •5-12*5 
12-5-13-5 

192 

242 

23-5-24*5 

2 

Total 

1567 
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(а) Table 4.1. In this illustration the birth-rates per thousand of 
the population in 1933 of 1,567 local government areas of England have 
been classified to the nearest unit ; i.e. the number of districts has been 
counted in which the birth-rate was between 1*5 per thousand and 2*5, 
between 2*5 and 3*5, and so on. The frequency-distribution is shown by 
the table. 

Although a glance through the original returns, which are spread amongst 
many other figures over 42 pages, fails to convey any definite impression, 
a brief inspection of the above table brings out a ntunber of important 
points. Thus, we see that the birth-rates range, in round numbers, from 
2 to 24 per thousand ; that the birth-rates in some 75 per cent of the 
districts lie within the narrow limits 10*5 to 16*5, the rates most frequent 
being near 14 ; and so on. It may be remarked that some of the areas 
are very small, with no more than 10 or 20 births, and these account 
mainly for the extremely divergent rates. 

(б) Table 4.2. The numbers of stigmatic rays on a number of Shirley 
poppies were counted. As the range of variation is not great, the unit 
is taken as the class-interval The frequency-distribution is given by 
the following table — 


TABLE 4.2 — Showing the frequencies of teed capsules on certain Shirley poppies with 
different numbers of stigmatic rays 

(Cited from G. Udny Yule, SwuMfrOs, 1902, S, 89) 



Number of 


Number of 

Number of 

capsules 

Number of 

capsules 

stigmatic 

with said 

stigmatic 

wiw said 

rays 

number of 

rays 

number of 


stigmatic rays 


stigmatic rays 

6 

3 

14 

302 

7 

11 

15 

234 

8 

38 

16 

128 

9 

106 

17 

50 

10 

1S2 

18 

19 

11 

238 

19 

3 

12 ; 
13 

305 

315 

20 

1 

Total 

1905 




The numbers of rays range from 6 to 20, the most usual numbets being 
12. 13 or 14. 

(c) Table 4.3. 206 screws were taken as they came off the lathe which 
was turning them. Their lengths, which i^ovld have been 1 inch, were 
measured. The following table shows the screws classified by the nombear 
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of thousandths of an inch by which they exceeded or fell short of 1 inch 
in length — 

TABLE 4.3>-Showing the frequencies of 8cre%vs classified according to the extent to 
which they varied in length from the standard of 1 inch 


Difference in length 
from 1 inch 
(Thousandths of an 
inch) 

Number of 
screws 

Difference in length 
from 1 inch 
(Thousandths of an 
inch) 

Number of 
screws 

—6 to -5 

1 

-f I to 4*2 

34 

-5 to -4 

4 

+2 to +3 

25 

-4 to -3 

11 

+ 3 to +4 

16 

—3 to -2 

22 

+4 to 4 S 

8 

-2 to -1 

25 

-f- 5 to -f- 6 

1 

— 1 to 0 

27 



0 to +1 

32 

Total 

206 


It will be seen that the maximum frequency, i.e. 34, occurs for screws 
from 0-001 to 0-002 inch in excess of the standard. About 80 per cent 
lie in the range three-thousandths of an inch on either side of the standard. 

4.5 Expanding slightly the brief description we have given, tables 
setting out frequency-distributions are formed in the following way — 

(1) The magnitude of the class-interval is first fixed. In Tables 4.1, 
4.2 and 4.3 one unit was chosen. 

(2) The position or origin of the intervals must then be determined ; 
e.g. in Table 4.1 we must decide whether to take as intervals 9-10, 10-11, 
11-12, etc., or 9-5-10-5, 10-5-11-5, 11-5-12-5, etc. 

(3) This choice having been made, the complete scale of intervals is 
fixed and the observations are classified accordingly. 

(4) The process of classification being finished, a table is drawn up on 
the general lines of Tables 4. 1-4.3, showing the total number of observa- 
tions in each class-interval. 

It is necessary to make a few remarks about each of these heads. 

Magnitude of class-interval 

4.6 As already remarked, in cases where the variation proceeds by 
discrete steps of considerable magnitude as compared with the range of 
variation, there is very little choice as regards the magnitude of the class- 
interval. The unit will in general have to serve. But if the variation 
be continuous, or at least takes place by discrete steps which are* small 
in comparison with the whole range of variation, there is no such natural 
class-interval, and its choice is a matter for judgment. 

The two conditions which guide the choice are these : (a) We desire 
to be able to treat all the values assigned to any one class, without serious 
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error, as if they were equal to the mid-value of the class-interval, e.g, 
as if the birth-rate of every district in the first class of Table 4.1 were 
exactly 2*0, the birth-rate of every district in the second class 3*0, and 
so on ; (b) for convenience and brevity we desire to make the interval 
as large as possible, subject to the first condition. These conditions will 
generally be fulfilled if the interval be so chosen that the whole number 
of classes lies between 15 and 25. A number of classes less than, say, 
ten leads in general to very appreciable inaccuracy, and a number over, 
say, thirty makes a somewhat unwieldy table. A preliminary inspection 
of the record should accordingly be made and the highest and lowest 
values be picked out. Dividing the difference between these by, say, 
twenty-five, we have an approximate value for the interval. The actual 
value should be the nearest integer or simple fraction. 

Position of intervals 

4.7 The position or starting-point of the intervals is, as a rule, more or 
less a matter of indifference. It can therefore be chosen as is most 
convenient for the particular case under discussion, e.g. so that the limits 
of the intervals are integers, or, as in Table 4.1, so that the mid-values are 
inte^gers. It may also be chosen so that no limits correspond exactly 
to any recorded value of the variate, in order to obviate any difficulty 
in deciding to which class a particular individual should be assigned 
(cf. 4.9). 

The location of the intervals is, however, important when the values 
of the variate tend for some reason to cluster round particular values. 
Such a case arises, for instance, in age returns, owing to the tendency 
to state a round number where the true: age is unknown, or a reluctance 
to admit one's real age.^ It is also common wherever there is some 
doubt as to the final digit in reading a scale, and scope is given to the 
idiosyncrasies of the observer. 

Table 4.4 shows results for four observers as illustrations, the frequencies 
being reduced for comparability to a total of 1 ,000. Column A is based 
on measures by G. U. Yule, on drawings, to the nearest tenth of a milli- 
metre. It is recognised, of course, that measures cannot really be made to 
such a degree of precision ; but the measurer believed that he was making 
them carefully, and as they were made with a Zeiss scale, in which the 
divisions are ruled on the under side of a piece of plate-glass, readings 
were unaffected by parallax. Nevertheless, it will be seen that the 
zeros, and also 2, 8 and 9, were heavily over-emphasised — an odd selection 
of preferences! On the whole, the centre of the millimetre was neglected 
and measures piled up at the two ends. 

The data for columns B, C and D are all drawn from the same published 
report, and refer to sundry head measurements taken on the living subject. 

>This effect is practically the same lor men as for women. Cf. Table I in the Appmi* 
dix to the paper cited in the heading to Table 4.4 above. 
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On the basis of a statement in the introduction to the report, it was possible 
to compile the data separately for the three assistants (B, C, D) who had 
done the actual measuring. It will be seen that B was rather good : there 
is a relatively slight excess at 0 and 5, but otherwise his measurements are 
fairly uniformly distributed. C was decidedly not good, rounding ofi nearly 
one measurement in two to the nearest centimetre or half-centimetre. D 
was amply outrageously bad — ^so bad that it might have been better not 
to publish his measurements. Nearly 57 per cent of his measurements 
were made only to the nearest centimetre or half-centimetre — a quite 
inadequate degree of precision for head measurements often only a few 
centimetres in magnitude. 


TABLE 4.4 — Frequency-distrilmtions of final digits in measurements by four observers 

(G. U. Yuk, “ On Readine a Scale,” J. Soy. Slat. Soc., 1927, M, 570) 


Final digit 

Frequency of final digit per 1.000 for observer 

A B C D 

0 

158 

122 

251 

358 

1 

97 

98 

37 

49 

2 

125 

98 

80 

90 

3 

73 

90 

72 

63 

4 

76 

100 

55 

37 

5 

71 

112 

222 

211 

6 

90 

98 

71 

62 

7 

56 

99 

75 

70 

8 

126 

101 

72 

44 

9 

129 

81 

65 

16 

Total 

1001 

999 

1000 

1000 

Actual ob-1 
servations j 

1258 

3000 

1000 

1000 


When there is any possibility of clustering of variate values it is as 
well to subject the data to a close examination before finally fixing on 
the method of classification. On the whole, the intervals should be 
arranged as far as possible so that the values round which the clustering 
occurs fall towards the interval mid-values. This procedure avoids 
sensible error in the assumption that the interval mid-value is approxi- 
mately representative of the values of the class. 

Classification 

4.8 The scale of intervals having been fixed, the observations may 
be classified. If the number of observations is not large, it will be sufficient 
to mark the limits of successive intervals in a column down the left-hand 
ride of a sheet of paper, and transfer the entries of the original record 
to this sheet by marking a 1 on the line corresponding to any class for 
eadh entry assigned thereto. It saves time in subsequent totaling; if 
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each fifth entry in a class is marked by a diagonal across the preceding 
four, or by leaving a space. 

The disadvantage in this process is that it offers no facilities for checking ; 
if a repetition of the classification leads to a different result, there is no 
means of tracing the error. If the number of observations is at all con- 
siderable and accuracy is essential, it is accordingly better to enter the 
values observed on cards, one to each observation. These are then 
dealt out into packs according to their classes and the whole work checked 
by ninning through the pack corresponding to each class, and verifying 
that no cards have been wrongly sorted. 

4.9 In some cases difficulties may arise in classifying, owing to the 
occurrence of observed values corresponding to class-limits. Thus, in 
compiling Table 4.1 some districts will have been noted with birth-rates 
entered in the Registrar-General’s returns as 16-5, 17 • 5 or 18-5, any one 
of which might at first sight have been apparently assigned indifferently 
to eitlier of two adjacent classes. In such a case, however, where the 
original figures for numbers of births and population are available, the 
difficulty may be readily surmounted by working out the rate to another 
place of decimals : if the rate stated to be 16"5 proves to be 16-502, it 
will be sorted to the class 16-5-17-5 ; if 16-498, to the class 15-5-16-5. 
Birth-rates that work out to half-units exactly do not occur in this example, 
and so there is no real difficulty. 

In the case of Table 4.3, again, there is little difficulty in knowing the 
class to which an individual should be assigned. 

Difficulties of this type may, in fact, alwaj^ be avoided if they are 
borne in mind in fixing the class-intervals, by fixing the intervals to a 
further place of decimals or a smaller fraction than the values in the 
original record. Thus, if statures are measured to the nearest centimetre, 
the class-intervals may be taken as 150-5-151-5, 151 -5-152-5, etc.; if to the 
nearest eighth of an inch, the intervals may be 59^^-60jj^, 60^-61 
and so on. 

If the difficulty is not evaded in any of these ways, it is usual to assign 
one-half of an intermediate observation to each adjacent class, with the 
result that half-units occur in the class-frequencies (cf. Table 4.9, p. 86). 
The procedure is rough, but probably good enough for practical purposes ; 
strict precision is usually unattainable, for in point of fact the odd way in 
which different individuals read a scale, for example, renders it impossible 
to assign exact limits to intervals. 

Tabulation 

4.10 As regards the actual drafting of the final table there is little 
to be said, except that care should be taken to express the class-limits 
clearly and, if necessary, to say how the difficiUty of intermediate values 
has bwn met or evaded. The class-limits are perhaps best given as in 
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Tables 4.1 and 4.3, but may be more briefly indicated by the mid- values of 
the class-intervals. Thus, Table 4. 1 might have been given in the form — 


Birth-rate per 1,000 to Number of districts with 

the nearest unit said birth-rate 

2 1 

3 2 

4 2 

etc. etc. 

It is also permissible to write the table in the form — 
Interval Frequency 


it being understood that the closing point of any interval is the starting 
point of the following interval. Cf. Table 4.11 below. 

It should be noticed that the method of defining class-intervals adopted 
in Table 4.3 leaves the class-limits uncertain unless the degree of accuracy 
of the measurements is also given. Thus, in a table giving frequencies of 
men in certain height-ranges of 1 inch in width, say 57 and less than 58/' 
etc., if measurements were taken to the nearest eighth of an inch, the class- 
limits are really 56^-57^1, 57i|-58||, etc.; if they were only taken to 
the nearest quarter of an inch, the limits are 56J-57J, 57|-58J, etc. With 
such a form of tabulation a statement as to the number of significant figures 
in the original record is therefore essential. It is better, perhaps, to state 
the true class-limits and avoid ambiguity. 


4.11 The rule that class-intervals should be all equal is one that is 
very frequently broken in official statistical publications, principally in 
order to condense an otherwise unwieldy table, thus not only saving space 
in printing but also considerable expense in compilation, or possibly, in the 
case of confidential figures, to avoid giving a class which would contain 
only one or two observations, the identity of which might be guessed. It 
would hardly be legitimate, for example, to give a return of incomes relating 
to a limited district in such a form that the income of the two or three 
wealthiest men in the district would be clear to any intelligent reader with 
local knowledge. 

If the class-intervals be made unequal, the application of many statis- 
tical methods is rendered awkward, or even impossible. Further^ the 
relative values of the frequencies are misleading, so that the table is not 
perspicuous. Thus, consider the first two columns of Table 4.5, showing 
the number of persons liable to sur-tax and super-tax classified according 
to their annual income. On running the eye down the column headed 

Number of Persons," the attention is at once caught by the three irregu* 
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larities at the classes " 3^3,000 and not exceeding £4,000/' " £8,000 and 
not exceeding £10,000," and " £10,000 and not exceeding £15,000/' But 
these have no real significance ; they are merely due to changes in the 
magnitude of the class-interval at those points. A further change occurs 
at the £30,000 and at the £50,000 mark, although the attention is not 
directed thereto by any marked irregularity in the frequencies. 

TABLE 4.5 — The numbers of persons in the United Kingdom liable to sur-tax and 
super-tax in the year beginning 5th April 1931 

Classified according to the magnitudes of their annual incomes 
(From the Statistical Abstract for the United Kingdom for the Years 1913 and 1919-32, Cmd. 4489) 


Annual income 
(;^000) 

Number of 
persons 

Frequency per 
;f500 interval 

2 and not exceeding 2*5 

23,988 

23,988 

2-5 .. 


3 

15,781 

15,781 

3 .. 


4 

17,979 

8,989 

4 


5 

9,755 

4,877 

5 .. 


6 

5,921 

2,960 



7 

3,729 

1,864 

7 M 


8 

2.546 

1,273 

8 „ 


10 

3.193 

798 

10 


15 

3,616 

362 

15 


20 

1,328 

133 

20 .. 


25 

679 

68 

25 ,, 


30 

378 

38 

30 


40 

372 

19 

40 „ 


50 

192 

10 

50 


75 

182 

4 

75 


100 

57 

1 

100 and over 


94 

1 

? 

Total number of persons 

89,790 

! 


To make the class-frequencies really comparable uiter se they must first 
be reduced to a common interval as basis, say £500, by dividing the third 
and subsequent numbers by 2, the eighth by 4, and so on. This gives 
the mean frequencies tabulated in the third column of Table 4.5. The 
reduction is, however, impossible in the case of the last class, for we are 
told only the number of persons with an income of'£100,000 and upwards. 
Sucii an indefinite class is in many respects a great inconvenience, and 
should always be avoided in work not subjected to the necessary limitations 
of official publications. 

4.12 The general rule that intervals should be equal must not be held 
to bar the analysis by smaller equal intervals of some portion of the range 
over which the frequency varies very rapidly. In Table 4.11, page 89, 
for example, giving the numbers of deaths from scarlet fever at successive 
ages, it is desirable to give the numbers of deaths in each year for the first 
five years, so as to bring out the rapid rise to the maximum in the third 
year of life. 
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Graphical representation : frequency-polygon and histogram 
4.13 It is often convenient to represent the frequency-distribution 
by means of a diagram which conveys to the eye the general run of the 
observations. The following short table, giving the distribution of head- 
breadths for 1 ,000 men, will serve as an example — 

TABLE 4.6— Showing the frequency-distributioD of bead-breadths for students at 

Cambridge 

Measurements taken to the nearest tenth of an inch 
(Cited from W. R. Macdonell, Biometrika, liK)2, 1, 220) 


Head-breadth 
in inches 

Number of 
men with said 
head-breadth 

Head-breadth 
in inches 

Number of 
men with said 
head -breadth 

5*5 

3 

6-3 

99 

5-6 

12 

6-4 

37 

5*7 

43 

6-5 

15 

5*8 

80 

6-6 

12 

5-9 

131 

6-7 

3 

60 

236 

6-8 

2 

6* 1 

185 






6-2 

142 

Total 

1000 


Taking a piece of squared paper ruled, say, in inches and tenths, mark 
off along a horizontal base-line a scale representing class-intervals ; a 
half-inch to the class-interval would be suitable. Then choose a vertical 
scale for the class-frequencies, say 50 obser\’ations per interval to the inch, 
and mark off, on the verticals or ordinates through the points marked 5*5, 
5'6, 5'7, . . . at the centres of the class-intervals on the base-line, heights 
representing on this scale the class-frequencies 3, 12, 43, . . . The diagram 
may then be completed in one of two ways : (1) as a frequency-polygon, 
by joining up the marks on the verticals by straight lines, the last points at 
each end being joined down to the base at the centre of the next class- 
interval (fig. 4.1) ; or (2) as a column diagram or histogram, short 
horizontals being drawn through the marks on the verticals (fig. 4.2), which 
now form the central axes of a series of rectangles representing the class- 
frequencies. 

4.14 The student should note that in any such diagram, of either form, 
a certain area represents a given number of observations. On the scales 
suggested, 1 inch on the horizontal represents 2 intervals, and 1 inch 
on the vertical represents 50 observations per interval : 1 square inch 
therefore represents 50 x 2=100 observations. The diagrams are, how- 
ever, conventional : in both cases the whole area of the figure is pro- 
portional to the total number of observations, but the area over every 
interval is not correct in the case of the frequency-polygon, and the 
frequency of every fraction of any interval is not the same, as suggested 
by the histogram. The area shown by the frequency-polj^on over any 
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interval with an ordinate (fig. 4.3) b only correct if the tops of the three 
successive ordinates lie on a line, i.e. if y*=4{yi+y*). the areas of 

the two little triangles ^aded in the figure being equal. If y, fall short of 
thb value, the area shown by the polygon b too great ; if y, exceed it. 



6 -7 e 0 6V -1 -Z -J 4 S -6 7 ’8 

Heajdr hrecuith. in inchts 


Fig. 4.1. — Frequency-polygon for head-lireadths of 1,000 Cambridge students 

(Table 4.6) 



5S B -7 -8 8 $i} •/ *je -3 *# *5 C t *5 

Bead, breadth in, inchee. 


Fig. 4.2.-— Histogram for the same data as Hg. 4.1 
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the area shown by the polygon is too small ; and if, for this reason, the 
frequency-polygon tends to become very misleading at any part of the 
range, it is better to use the histogram. 

4.15 The histogram may also be used when the class-intervals are 
unequal. The construction of the previous section is easily adapted to 
such cases. All that is necessary is to describe an area equal, on the scale 
adopted, to the frequency in a particular interval ; this is done, as before, 
by erecting at the centre of the interval an ordinate equal in length to 
the total frequency divided by the width of the interval. 

An example of this kind of con- 
struction is given in fig. 4.11 (Table 
4.11). The frequencies of deaths for 
ages over 5 years are given in 5-yearly 
periods, whereas those for ages under 
5 years are given in 1 -yearly periods. 
On the scale indicated, therefore, the 
height of the cell of the histogram cor- 
responding to the ages 2-3 years is 
89, the class-frequency ; that of the 
cell corresponding to the ages 5-10 is 
42*6, i.e. 213 divided by 5. Hence the 
areas of the two cells are, to the scale 
adopted, 89 and 213, respectively, so that the areas accurately represent 
the frequencies. 

Frequency-curves 

4.16 If the class-intervals be made smaller, and at the same time the 
number of observations increased so that the class-frequencies may 
remain finite, the polygon and the histogram will approach more and 




Pig. 4.4 
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more closely^ to a smooth curve. Such an ideal limit to the polygon or 
the histogram is called a frequency-curve. It is a concept of supreme 
importance in statistical theory. 

In the frequency-curve the area between any two ordinates wjiatever 
is proportional to the number of observations falling between the corre- 
sponding values of the variable. Thus, the number of observations 
falling between the values of the variable and in fig. 4.4 will be 
proportional to the area of the shaded strip in the figure ; the number of 
observed values greater than x^ will be given by the area of the curve to 
the right of the ordinate at x^ ; and so on. 

4.17 When we come to consider the theory of sampling we shall regard 
the frequency curve as representing a population from which the actual 
data are a specimen. The frequency-polygon and the histogram will then 
be approximations to the curve, but will diverge from it to some extent 
owing to fluctuations of sampling. For the present we must defer a closer 
inquiry into this subject. We may remark, however, that when the 
number of observations is considerable — say a thousand at least — the 
run of the class-frequencies is usually sufficiently smooth to give a good 
notion of the form of the ideal " distribution. 

Some common types of frequency-distribution 

4.18 The forms presented by smoothly running sets of data are almost 
endless in their variety, but among them we may notice a comparatively 
small number of simple types. Such types also form a set into which 
more complex distributions may often be analysed. For elementary 





Fig. 4.5 . — hjBL ideal tymiiietrical frequcncHHsliiMta 
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purposes it is sufficient to consider four fundamental simple types, which 
we shall call the symmetrical distribution, the moderately asymmetrical 
or skew distribution,^ the extremely asymmetrical or J-shaped ffistribution 
and the U-shaped distribution. In the following sections we give some 
examples of each of these types, together with a few more complex 
distributions. 

The symmetrical distribution 

4.19 In this type the class-frequencies decrease to zero S 3 rmmetrically 
on either side of a central maximum. Fig. 4.5 illustrates the ideal form 
of the distribution. 

Being a special case of the more general t 5 q)e described under the 
second heading, this form of distribution is comparatively rare. It 


TABLE 4.7 — ^The trequency-distributioiis of statures for adult males bom in England 
Scotland, Wales and Ireland 

As measurements are stated to have been taken to the nearest )-th of an inch, the 
class-intervals are here presumably 56||~57i|, 57i|-58^, and so on (cf, 4.9). 
(See Eg. 4.6.) 

(Final Report of the Anthropoznetcic Coausittee to the British AssocLatioa.) {tUpori^ 1883. p. 256.) 


Height without 
shoes, inches 

Number of men within said limits of height 
Place of birth — 

England Scotland Wales Ireland 

Total 

57- 

1 

— 

1 

— 


58- 

3 

1 


— 


59- 

12 

— 

I 

1 


60- 

39 

2 

— 

— 


61- 

70 

2 

9 

2 

83 

62- 

128 

9 

30 

2 

169 

63- 

320 

19 

48 

7 

394 

64- 

524 

47 

83 

15 

669 

65- 

740 

109 

108 

33 

990 

66- 

881 

139 

145 

58 

1,223 

67- 

918 

210 

128 

73 

1,329 

68- 

886 

210 

72 

62 

1,230 

69- 

753 

218 

52 

40 

1,063 

70- 

473 

115 

33 

25 

646 

71- 

254 

102 

21 

15 

392 

72- 

117 

69 

6 

10 

202 

73- 

48 

26 

2 

3 

79 

74- 

16 

15 

1 

— 

32 

75- 

9 

6 

1 

— 

16 

76- 

1 

4 



* 

5 

77- 

1 

* 

— 

— 

2 

Total 

6,194 

1,304 

741 

346 

8.585 


^These two types, from their shape, are frequently referred to as ** hcumped,** 
** OQClmd hat,** V ain^e peaked/* and so on. 
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occurs in the case of biometric, more especially anthropometric, measure- 
ments, from which the following illustration is drawn, and is important 
in much theoretical work. Table 4.7 shows the frequency-distribution of 
statures for adult males born in the British Isles, from data published by a 
British Association Committee in 1883, the figures being given separately 
for persons born in England, Scotland, Wales and Ireland, and totalled 
in the last column. These frequency-distributions are approximately of 
the symmetrical type. The frequency-polygon for the totals given by 
the last column of the table is shown in fig. 4.6. The student will notice 
that an error of ^ inch, scarcely appreciable in the diagram on its reduced 
scale, is neglected in the scale shown on the base-line, the intervals being 
treated as if they were 57-58, 58-59, etc. Diagrams should be drawn for 
comparison showing, to a good open scale, the separate distributions for 
England, Scotland, Wales and Ireland. 



Fig. 4.6. — ^Frequency-distribution of stature for 8,585 adult males boru in the Brltisb 

Isles (Table 4.7) 


The moderately asymmetrical (skew) distribution 

4,20 In this case the class-frequencies decrease with markedly greater 
rapidity on one side of the maximum than on the other, as in fig. 4.7 («) 
or (6), This is the most common of all smooth forms of frequency- 
distribution, illustrations occurring in statistics from almost every source. 
the distribution of birth-rates given in Table 4.1 is slightly asymmetrical. 
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(h) (a.) 



Fig. 4.7. — Ideal distrllrations of Uie moderately aaymmetrical form 


The distribution of Australian marriages given in Table 4.8 (fig. 4.8) 
is rather more asymmetrical and is of the type (a) of fig. 4.7. The 
frequency attains its maximum for ages between 24 , and 27 and then 
tails off slowly. We have not drawn the tail of the curve, which is very 
close to the *-axis, for values of the variate above 58*5. 

Table 4.9 and fig. 4.9 give a biological illustration, viz. the distribution 
of fecundity (ratio of yearling foals produced to coverings) in mares. 


TABLE 4.8. — ^Nnndwrs of manriage* contracted in Auttralla, 1907-14 

Arranged according to the age of bridegroom in 3-year groups 
(From S. J. Pretorius, ** Skew Bivariate Frequency Surfaces/* BiomOrika, 1930, 22, 210) (See fig. 4.8) 


Age of bridegroom 
(Cent^ value of 3-year 
range, in years) 

Number of 
marriages 

Age of bridegroom 
(Central value of 3-year 
range, in years) 

Number of 
marriages 

16-5 

294 

55*5 

1,655 

19*5 

10,995 

58*5 

1,100 

22*5 

61,001 

61*5 

810 

25*5 

73,054 

64*5 

649 

28*5 

56,501 

67*5 

487 

31*5 

33,478 

70*5 

326 

34*5 

20,569 

73*5 

211 

37*5 

14,281 

76*5 

119 

40*5 

9,320 

79*5 

73 

43*5 

6,236 

82*5 

27 

46*5 

4,770 

85*5 

14 

49*5 

52*5 

3,620 

2,190 

88*5 

5 

Total 

301,785 
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The student should notice the difficulty of classification in this case : 
the class-interval chosen throughout the middle of the range is 1 /15th, 
but the last interval is 29/30-1." This is not a whole interval, but it 
is more than a half, for all the cases of complete fecundity are reckoned 
into the class. In the diagram (fig. 4.9) it has been reckoned as a whole 
class, and this gives a smooth distribution. 

To take an illustration from meteorology, the distribution of barometer 
heights at any one station over a period of time is, in general, asymmetrical, 
the most frequent heights l 3 dng towards the upper end of the range for 
stations in England and Wales. Table 4.10 and fig. 4.10 show the dis- 
tribution for daily observations at Greenwich during the years 1848-1926 
inclusive. 

The distributions of Tables 4.8-4.10 all follow more or less the type 
of ug. 4.7 (a), the frequency tailing off, at the steeper end of the distribu- 
tion, in such a way as to suggest that the ideal curve is tangential to the 
base. Cases of greater asymmetry, suggesting an ideal curve that meets 
the base (at one end) at a finite angle, even a right angle, as in fig. 4.7 (b), 
are less frequent, but occur occasionally. The distribution of deaths 
from scarlet fever, according to age, affords one such example of a more 
asymmetrical kind. The actual figures for this case are given in Table 
4.1J and illustrated by fig. 4.11 ; and it will be seen that the frequency 
of deaths reaches a maximum for children aged " 2 and under 3," the 
number rising very rapidly to the maximum, and thence falling so slowly 


TABLE 4.9. — The frequency-distribution of fecundity, l.e. the ratio of the number of 
yearling foals produced to the number of coverings, for brood-mares (racehorses) 
covered eight times at least 


(See fig. 4.9) 

(FVarsou, Lee aaU Moore^ Phxl. Tran%., A, 1899, 192, 303) 


Fecundity 

Number of 
mares with 
fecundity 
between the 
given limits 

Fecundity 

Number of 
mares with 
fecundity 
between the 
given hmits 

1/30- 3/30 

2 

17/30-19/30 

315 

3/30- 5/30 

7-5 

19/30-21/30 

337 

5/30- 7/30 

11-5 

21 /30-23/30 

293*5 

7/30- 9/30 

21-5 

23/30-25/30 

204 

9/30-11/30 

55 

25/30-27 /30 

127 

11/30-13/30 

104*5 

27/30-29/30 

49 

13/30-15/30 

182 

29/30-1 

19 

15/30-17/30 

271*5 

Total 

2000*0 
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I 

'zot^ 


toc\ 


V 

V 


^0 i/ts 2jl5 4^5 S/JS 6jl5 l/lS ^5 jijlS 10/15 lljiS UjlB IS/lS 14/15 1 
JtoLtio of YearUng fooCLa prvducedL to coverings. 

Fig. 4.9. — ^Frequency«dl$tribution of fecundity for brood-mares (Table 4.9) 


that there is still an appreciable frequency for persons over 50 years of 
age. 

Asymmetrical curves are also said to be skew.** In Chapter 7 we 
shall consider skewness at some length and discuss various ways of 
measuring it. In particular we shall find that skewness has a sign, and 
we may explain at this stage that the skewness is said to be positive if 
the longer tail of the curve lies to the right, or negative if it lies to the 
left ; e.g. the curve of fig. 4.8 has positive skewness, whilst those of figs. 4.9 
and 4.10 have negative skewness. 

The extremely asymmetrical, or J-shaped, distribution 

4.21 In this type the class-frequencies run up to a maximum at one end 
of the range, as in fig. 4.12. 

This may be regarded as a limiting form of the previous distribution, 
and, in fact, the two cannot always be distinguished by elementary nietho^ls 
if the original data are not available. If, for instance, the frequencies of 
Table 4.1 1 had been given by five-year intervals only, they would have run 
322, 213, 70, 27, etc., thus suggesting that the maximuia number ot deaths 
occurred at the beginning of life, i.e. that the distribution was J-shaped. 
It is only the analysis of deaths in the earlier years by one-year intervals 
winch shows that the frequencies reach a maximum in the third year and 
that therefore the distribution is of the moderately asymmetrical tyj)e. 
In practical cases no hard -and-fast rule can be drawn between the moder- 
ately and extremely asymmetrical types, any more than between the 
asymmetrical and the S3rmmetrical types. 
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TABLE 4J0.->Barometrlc heights at Greenwich on alternate days from 1848 to 1826 

(See fig. 4.10) 

(Data from S. J. Pretorius, Skew Bivariate Frequency Surfaces/' Biomdrika, 1990, 22, 154) 



Barometric height (inches) 


Fig. 4.10. — Baiosuttic hcij^t at Gcccnwich on alternate daye from 1848-1026 

(Table 4.10) 
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TABLE 4.11. — ^The number of deaths from scarlet fever at different ages in England 

and Wales in 1933 

(See fig. 4.11) 

(Data from Registrar-General’s Statistical Review of England and Wales for 1933, Tables, Part I, Medical) 


Age in years 

Number of deaths 

Number per year 

0- 

16 

16 

1- 

69 

69 

2- 

89 

89 

3- 

74 

74 

4- 

74 

74 

5- 

213 

42-6 

10- 

70 

140 

15- 

27 

5-4 

20- 

26 

5*2 

25- 

17 

3-4 

30- 

12 

2-4 

35- 

11 

2-2 

40- 

10 

20 

45- 

6 

1*2 

50- 

7 

1*4 

55- 

5 

1*0 

bO— 

— 

— 

65- 

1 

0*2 

70- 

1 

0*2 

75— 

1 

0*2 

80- 

— 

— 

Total 

729 

— 


4.22 In economic statistics this form of distribution is particularly 
characteristic of the distribution of wealth in the population at large, as 
illustrated by income tax and house valuation returns, and the curve to 
which it gives rise has been called the " Pareto line,” after Vilfredo Pareto 
who directed the attention of economists to it. 

Such distributions may, of course, be a very extreme case of the last 
type. It is difficult to say. But if the maximum is not absolutely at the 
lower end of the range, it is very close thereto. 

Official returns do not usually give the necessary analysis of the 
frequencies at the lower end of the range to enable the exact position of the 
maximum to be determined ; and for this reason the data on which Table 
4.12 is founded, though of course very unreliable, are of some interest. It 
will be seen from the table and fig. 4.13 that with the given classification 
the distribution appears clearly assignable to the present type, the number 
of estates between zero and £100 in annual value being more than six times 
as great as the number between £100 and £200 in annual value, and the 
frequency continuously falling as the value increases. A close analysis of 
the first class suggests, however, that the greatest frequency does not occur 
actually at zero, but that there is a true maximum frequency for estates of 
about £1 15/- in annual value. The distribution might therefore be more 
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correctly assigned to the second t 3 rpe, but the position of the greatest 
frequency indicates a degree of skewness which is high even compared 
with the skewness of fig. 4.11. ~ 

The t 3 rpe is more frequent in other classes of material than was at one 
time thought. Distributions of deaths of centenarians afford an example, 
and so, curiously enough, do deaths of infants unless the class-interval 
is exceedingly fine — a matter of hours. The distribution may be obtained 
by compiling the frequencies of the numbers of genera with 1, 2, 3, . . . 
species in any biological group. Table 4.13 shows such a distribution for 
the Chrysomelid beetles. Yule has also shown that it is characteristic 
of the numbers of words used once, twice, thrice, etc., in a given work 
and has used it in investigations into literary vocabularies. 

The U-shaped distrihutioii 

4.23 This t 3 q)e exhibits a maximum frequency at the ends of the range 
and a minimum towards the centre, as in fig. 4.14. 
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This is a rare but interesting form of distribution, as it stands in some- 
what marked contrast to the preceding forms. Table 4.14 and fig. 4.15 
illustrate an example based on a considerable number of observations, viz. 
the distribution of degrees of cloudiness, or estimated percentage of the sky 
covered by cloud, at Greenwich in July. 

For the purposes of the illustration we regard cloudiness as a variate 
varying from complete overcastness to clear sky, the range being divided 
into eleven equal parts. 

It will be seen that a sky completely or almost completely overcast at 
the time of observation is the most common, a practically clear sky comes 
next, and the intermediates are more rare. 

The remarks we made about the extreme end of the J-shaped dis- 
tribution also apply to the U-shaped distribution. In particular cases it 



Fig. 4.12. — An ideal distrilration of the extremely asymmetrical form 

may be that the grouping is too coarse to reveal the true character of the 
frequency at the maxima, and if the data were more complete we might 
discover that the two arms of the U in fact were bent over. 

Tnmcated forms 

4.24 The four types we have been considering sometimes occur in an 
incomplete form. Certain limitations on the range of the variate may 
result in a kind of truncation at one end or the other. Consider, for 
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example. Table 4.15, p. 96. In obtaining these figures, twelve dice were 
thrown and the occurrence of a 6 was called a success. At one throw there 
could thus be any number of successes from 0 to 12. The dice were thrown 
4096 times. 



Jnniial vaZae irv £lOO 


Fig. 4.13. — Frcqnency-distrilnition of Oie annual value* of certain estates In England 
in 1715; 2,476 estates (TaUe 4.12) 

Fig. 4.16 gives the frequency-polygon for this distribution. We can 
peture it as a slightly skew distribution which has been cut off on the left 
owing to the inadmissibility of negative values of the variate. Discon- 
tinuous variates not infrequently give rise to this effect of truncation. 

Coaqdez distributions 

4.25 Table 4.16 gives the number of male deaths within certain age- 
limits for England and Wales in the years 1930-32. 



^tanber of observottiona per unit t 



ng. 4.15.— OondiiicH at CKcnwidi la July ; 1,715 <riwcn«tkmi (TaUc U4) 
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The histogram for these data is given in fig. 4.17. It will be seen that 
the distribution has three maxima, one for each of the 0-5, the 20-25 and 
the 70-75 age-groups. 

Without looking too closely into this mortality curve we can see that 
the high frequency at the beginning is undoubtedly due to the heavy 
infantile death-rate. We can, if we choose, regard the distribution as 


TABLE 4.12. — The numbers and annual values of the estates of those who had taken 
part in the Jacobite rising of 1715 

(See fig. 4.13) 

(Compiled from Cosin 's ** Names of the Roman Catholics^ Nonjueors, and tdhers mho Refused to take the Oaths to h\s 
late Majesty Ktng George, etc.'* ; London, 1745. Fibres of very doubtful absolute value. See a note in Southey’s 
•* Commovflace Book,'* voL 1, p. 573, quoted from the Memoirs of T. HoUis) 


Annual 
value in 
£100 

Number of 
estates 

Annual 
value in 
£100 

Number of 
estates 

a- 1 

1726*5 

17-18 

1 

1- 2 

280 

— 

— 


140-5 

20-21 

4 


87 

21-22 

1 


46-5 

22-23 

1 

5 - 6 

42-5 

23-24 

1 

6- 7 

29*5 

— 

— 

7- 8 

25-5 

27-28 

2 

8- 9 

18-5 

— 

— 

9-10 

21 

31-32 

1 

10-11 

11*5 

— 


11-12 

9-5 

39-40 

"T 

12-13 

4 

— 


13-14 

3-5 

45-46 

1 

14-15 

8 

— 

— 

15-16 

3 

48-49 

1 

16-17 

5 




Total 

2,476 



made up by the superposition of three others : a J-shaped distribution 
for the lower years, a small one-humped distribution with its maximum 
about the period 20-25 years, and a skew distribution for the higher 
ages. This is an example of the fact we have already mentioned, that 
a complex distribution can sometimes be analysed into simpler types. 
In this particular case the anal}^is is likely to be of real service in actuarial 
work and in investigations into the causes of death. 

4.26 Finally, we give an example of a pseudo-frequency-distribution 
of a t 3 q)e occasionally resorted to when the data can be classified according 
to a characteristic which, though not strictly speaking measurable, can 
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nevertheless be graduated in an ordered sequence. Such a case arises 
fairly often in psychological work. 

A list of 100 words was read out to each of 11 subjects. Subsequently, 
at IS-minute intervals, four fresh lists were read out which contained 25 
of the words in the original and 25 new words, the four taken together 
accounting for the whole of the original 100. The subject had to say 
whether these individual words were in the original list or not, and to 
state whether he was certain, fairly sure, doubtful but inclined one way 
or the other, or merely doubtfuL The various phases of belief were 
then allotted numbers, and ran from —3 (certainty that a word was not 
in the original) through 0 (doubt, without inclination one way or the other) 
to +3 (certainty that a word was in the original). The tabulation on p. 97 
sets out the results for words in the original list (data reproduced by 
permission from the records of the Department of Psychology, University 
of St. Andrews). 


TABLE 4.13. — Chrysomelids (beetles). Numbers of genera with 1, 2, 3, . . . species 

(Compiled by Dr. J. C. Willis, F.R.S. ; cited from G. U. Yule, ** A Mathematical Theory of Bvolutioo based 
oo the elusions of Dr. J. C. Willis,** PkU, Trans,, B, 1924, 2U, 85) 


Species 

1 

2 

3 

4 

5 

6 

7 

8 
9 

10 

11 

12 

13 

14 

15 

16 

17 

18 

19 

20 
21 
22 

23 

24 

25 

26 

27 

28 

29 

30 


Genera 

215 

90 

38 

35 

21 

16 

15 

14 
5 

15 
8 
9 

5 

6 
8 
6 
6 

3 

4 

3 

4 

4 

5 
4 
2 
3 
1 
3 
3 
3 


Species 

32 

33 

34 

35 

36 

37 

38 

39 

40 

41 

43 

44 

45 

46 

49 

50 

52 

53 
56 

58 

59 
62 
63 

65 

66 
67 
69 

71 

72 

73 


Genera Species Genera 

1 74 1 

1 76 I 

1 77 1 

1 79 1 

3 83 1 

1 84 3 

1 87 2 

2 89 1 

2 92 2 

1 93 1 

4 no 1 

1 114 1 

1 115 1 

1 128 1 

2 132 1 

4 133 1 

1 146 1 

1 163 1 

1 196 1 

1 217 1 

1 227 1 

1 264 1 

3 327 1 

1 399 1 

1 417 1 

1 681 1 

1 
1 
1 
1 


Total 627 
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TABLE 4.14. — ^Thc frequencies estiiiuted intensities of doudineu at Greenwich 
dnrinS the years 1890-1904 (eadndlng 1901) for the month of July 

(See fig. 4.15) 

(Data from Gertrude £. Pearse, Bionutrika, 1928, 20A, 336) 


Degrees of 


Degrees of 

Frequency 

cloudiness 

Frequency 

cloudiness 

10 

676 

4 

45 

9 

148 

3 

68 

8 

90 

2 

74 

7 

65 

1 

129 

6 

55 


320 

e 

45 



o 

Total 

1,715 



TABLE 4.15.— Twelve dice Hnown 4,096 times, a throw of 6 points reckoned as a success 

(See fig. 4.16) 

(Weldon's data ; cited by F. Y. Edgeworth. Encydopedta BrtUmntca, 1 1 th cd., 22, 39) 

Nnmber of successes .0 1 2 3 4567 and over Total 

Number of throws . 447 1,145 1,181 796 380 115 24 8 4,096 



0 1 2 3 4 5 6 7 8 

Nvaaber of siiceesaea 


Pig. 4.16.— Frequency pdygon of successes with dice throwing (Table 4.U) 
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TABLE 4.16. — ^The number of male deaths in England and Wales for 1930-32 

Classified by ages at death 

(See fig. 4.17) 

(Data from Registrar-Geoerars Statistical Review of England and Wales, 1933, Text) 


Age at death 
(years) 

Number of deaths 

Age at death 
(years) 

Number of deaths 

0~ 5 

97,290 

55- 60 

56,639 

S-10 

11,532 

60- 65 

68,103 

10~!5 

7,305 

65- 70 

80,690 

15-20 

13,062 

70- 75 

84,041 

20-25 

16,741 

75- 80 

72,180 

25-30 

16,126 

80- 85 

45,094 

30-35 

15,673 

85- 90 

19,913 

35-40 

18,345 

90- 95 

5,145 

40-45 

23,778 

95-100 

767 

45-50 

33,158 

100 and over 

48 

50-55 

43,812 

Total 

'729,442 



Fig, 4.17. — Histogram of number of deaths at various ages (Table 4.16) 


Words in the original list were classified as — 



In 


Possibly 
either in 


Out 


Certain 

Fairly sure 

Doubtful 

or out 

Doubtful 

Fairly sure 

Certain 

+3 

+2 

+1 

0 

—1 

-2 

-3 

540 

117 

63 

39 

63 

87 

191 


These results are very curious, and are borne out by other data of a 
similar kind. In particular we see that there were more cases of certainty 
about something which was not true than of doubt without inclination. 
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In this example we are clearly making some assumption in allotting 
numbers to various degrees of belief ; but it would be impossible to 
measure belief on a scale, and we have to do the best we can. The numbers 
attached to the variate in such cases are not measures, but convenient 
ordinals, like the numbers attached to kings of the same name. For 
this reason a frequency diagram of such data can only give a very general 
idea of their true nature. 


SUMMARY 

1. Data in which the individuals are specified by the numerical values 
of a variable, or variate, may with convenience ^ arranged in a table 
which gives the frequency lying within successive, preferably equal, 
ranges of the variable. Such an arrangement is called a frequency- 
distribution. 

2. The frequency-distribution can be represented diagrammatically by 
means of a frequency-polygon or a histogram. 

3. The histogram is particularly appropriate to cases in which the 
firequency changes rapidly or the class-intervals are not all of the same 
width. 

4. As the width of the class-intervals becomes smaller, the frequency- 
polygon or the histogram may be imagined to approach a smooth curve, 
which is called the frequency-curve. 

5. A large number of frequency distributions occurring in practice 
fall into four types ; the symmetrical, the moderately asymmetrical or 
skew, the extremely as 3 mimetrical or J-shaped and the U-shaped t 5 rpes. 
Certain other distributions can be analysed into constituents each of 
which belongs to one of these tjq>es. 


EXERCISES 

4.1 If the diagram fig. 4.6 is redrawn to scales of 300 observations per 
interval to the inch and 4 inches of stature to the inch, what is the scale 
of observations to the square inch ? 

If the scales are 100 observations per interval to the centimetre and 2 
inches of stature to the centimetre, what is the scale of observations to the 
square centimetre ? 

4.2 If fig. 4.10 is redrawn to scales of 900 days to the inch and 0*3 inch of 
barometric height to the indi, what is the scale of observations to the 
square inch ? 
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If the scales are 400 days to the centimetre and O’l inch of barometric 
height to the centimetre, what is the scale of observations to the square 
centimetre ? 

4.3 If a frequency-polygon be drawn to represent the data of Table 
4.1, what number of observations will the polygon show between birth- 
rates of 16-5 and 17-5 per thousand, instead of the true number 89 ? 

4.4 If a frequency-polygon be drawn to represent the data of Table 4.6, 
what number of observations will the polygon show between head-breadths 
5-95 and 6 "05, instead of the true number 236 ? 

4.5 Draw frequency-polygons or histograms, as the case seems to require, 
for the following distributions, and assign them to the four types we have 
enumerated in 4.18 — 

(a) Size of firmc In the food, drink and tobacco trades of Great Britain 

The table shows the number of firms employing on an average certain numbers 

of persons — 

(Fiiul Re^wrt of the Fourth Census of F’rotluctiou, Part III) 


Size of firm (av- 
erage numbers 1 1-24 25-49 50- 100- 200- 300- 400- 500- 750- 1000- 1,500 ^ , 

employed) 99 199 299 399 499 749 999 1,499 and over 

Number of firms 2,245 1,449 771 439 164 75 36 54 31 23 29 5,316 


(b) The percentages of deaf-mutes among children of parents one of whom at least was a 
deaf-mute, for marriages producing five children or more 

(Compiled from material in “ Mamat;is. o/th< Deaf m America,*' ed. E. A. Fay, Volta Bureau, Washtagton, 1898) 


Percentage 

of 

deaf-mutes 

Number of 
families 

Percentage 

of 

deaf-mutes 

Number of 
families 

HifeSRiM 

220 

60- 80 

5*5 


20*5 

80-100 

15 


12 

Total 

273 


(c) Yield of grain in pounds from plots of R,\oth acre in a wheat field 

(Mftfccr and Hall, *' The Expeniaental Error of Field Trials,” Joum, Agr, Sciencet 4 , 1911, 107) 


Yield of grain in 
pounds per sAoth 
acre (Central 
value of range) 


2-8 3*0 3-2 3*4 3*6 3*8 4*0 4*2 4*4 4*6 4*8 5*0 5*2 ToUl 


Number of plots . 4 15 20 47 63 78 88 69 59 35 10 8 4 500 
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(d) The frequencies of different numbers of petals for three series of ranunculus hulhosus 

(R de Vries, Bef. deutsch, bot. Ges„ Bd. 12, 1894, q.v, for details) 


Number 
of petals 

Series A 

Frequency 

Series B 

Series C 

5 

312 

345 

133 

6 

17 

24 

55 

7 

4 

7 

23 

8 

2 

— 

7 

9 

2 

2 

2 

10 

— 

— 

2 

11 

— 

2 

— 

Total 

337 

i 

380 

222 


4.6 A number of perfectly spherical balls, all of the same material, give a 
symmetrical distribution when classified according to their diameters. 
Show that, if they are classified according to their weights, their frequency- 
distribution will be positively skew towards the higher weights. 

Table to Exercise 4.6 

The frequency-distribution of weights for adult males born in England, Scotland, Wales 
and Ireland (loc. dt., Table 4.7) 

Weights were taken to the neares^^ pound, consequently the true class-intervals are 

89 *5-99 -5, 99 *5-1 09 -5, etc. 


Weight 

Number 

of men within given limits of 
weight. Place of birth — 

Total 

in lb 

England 

Scotland 

Wales 

Ireland 


90~ 

2 







2 

100- 

26 

I 

2 

5 

34 

110- 

133 

8 

10 

1 

152 

120- 

338 

22 

23 

7 

390 

130- 

694 

63 

68 

42 

867 

140- 

1,240 

173 

153 

57 

1,623 

150- 

1,075 

255 

178 

51 

1.559 

160- 

881 

275 

134 

36 

1,326 

170- 

492 

168 

102 

25 

787 

180- 

304 

125 

34 

13 

476 

190- 

174 

67 

14 

8 

263 

200- 

75 

24 

7 

1 

107 

210- 

1 62 

14 

8 

1 

85 

220- 

1 33 

7 

1 



41 

230- 

10 

4 

2 



16 

240- 

9 

2 




11 

250- 

3 

4 

1 

— 

8 

260- 

1 





— , 

1 

270- 

— 



— 

— 

— 

280- 

— 

— 

1 

— 

1 

Total 

5,552 

1,212 

738 

247 

7,749 
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In the light of this result compare the distributions of Table 4.7 with the 
distributions of the table on the previous page. 

4.7 Toss a coin six times and note the number of heads. Repeat the 
experiment 100 times or more, and draw a frequency-polygon of your 
results classified according to the number of heads at each throw. 

4.8 Find the frequency-distribution of 200 bars of a waltz by Strauss 
classified according to the number of notes in the principal melody in the 
treble clef of each bar, and compare it with a similar distribution from 
modem waltzes. 

4.9 Examine qualitatively the effect on the distribution of Table 4.8 
of an allowance for the fact that minors tend to overstate their age when 
manning. 

4.10 The distribution of a herd of cows classified according to the quantity 
of milk produced by each cow per week is symmetrical. The distribution 
of the same herd classified according to the amount of butter-fat produced 
by each cow per week is negatively skew towards the lower quantities. 
Suggest a possible explanation for this fact. 



CHAPTER FIVE 


AVERAGES AND OTHER MEASURES OF 

LOCATION 


The principal characteristics of frequency-distributions 

5.1 The condensation of data into a frequency-distribution is a first 
and necessary step in rendering a long series of observations compre- 
hensible. But for practical purposes it is not enough, particularly when 
we want to compare two or more different series. As a next step we wish 
to be able to define quantitatively the characteristics of a frequency- 
distribution in as few numbers as possible. 

5.2 It might seem at first sight that very difficult cases of comparison 
of two distributions could arise in which, for example, we had to contrast 
a symmetrical distribution with a J-shaped distribution. In practice, 
however, we rarely have to deal with such a case. Distributions drawn 
from similar material are usually of similar form — as, for instance, when 
we wish to compare the distributions of stature in two races of man, or 
the birth-rates in English registration districts in two successive decades, 
or the numbers of wealthy people in two different countries. The practical 
use of the various statistical quantities which we shall discuss in this 
and the next two chapters is based on this fact. 

5.3 There are two fundamental characteristics in which similar frequency- 
distributions may differ — 

(1) They may differ markedly in position, i.e. in the value of the variate 
round which they centre, as in fig. 5.1, A. 

(2) They may differ in the extent to which the observations are dis- 
persed about the central value. Figs. 5.1, B and C, show cases in which 
distributions differ in dispersion only, and in both dispersion and position, 
respectively. 

To these two characteristics we may add a third group of less import- 
ance, comprising differences in skewness, peakedness, and so on. 

Measures of the first character, i.e. position or location, are generally 
known as averages. Measures of the second are termed measures of 
dispersion. Measures of the properties in the third group have each 
their appropriate name, which we shall give when we come to consider 
them in detail. 

t03 
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The present chapter deals only with averages. Chapter 6 deals with 
measures of dispersion, whilst Chapter 7 deals with the remaining 
quantities. 

Dimensions of an average 

5,4 In whatever way In average is defined, it may be as well to note 
it is merely a certain value of the variable, and is therefore necessarily 
of the same dimensions as the variable : i.e. if the variable be a length, 
its average is a length ; if the variable be a percentage, its average is a 
percentage ; and so on. But there are several different ways of approxi- 
mately defining the position of a frequency-distribution — ^that is, there 


<n ( 2 ) 




(!) 

c 

Fig.5.1 

are several different forms of average, and the question therefore arises, 
By what criteria are we to judge the relative merits of different forms ? 
>^at are, in fact, the desirable properties for an average to possess ? 

Desiderata for a satisfactory average 

5.5 (a) In the first place, it almost goes without sa 3 riing that an aven^ 
should be rigidly defined, and not left to the mere estimation of the 
observer. An average that was merely estimated would depend too 
■largely on the observer as well as the data. 

(5) An average should be based on aU the observations made. If not, 
it is not really a characteristic of the whole distribution. 

(c) It is desirable that the average should possess some sim^ and 
obvious properties to render its general nature readily comprehensible : 
an average should not be of too abstract a mathematical character. 
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(d) It is, of course, desirable that an average should be calculated with 
reasonable ease and rapidity. Other things being equal, the easier 
calculated is the better of two forms of average. At the same time 
great weight must not be attached to mere ease of calculation, to the 
neglect of other factors. 

[e) It is desirable that the average should be ts little affected as may 
be possible by what we have termed fluctuations of sampling. If different 
samples be drawn from the same material, however carefully they may 
be taken, the averages of the different samples will rarely be quite the 
same, but one form of average may show much greater ^fferences than 
another. Of the two forms, the more stable is the better. The full 
discussion of this condition must, however, be postponed to a later section 
of this work (Chap. 18). 

(/) Finally, by far the most important desideratum is this, that the 
measure chosen shall lend itself readily to algebraical treatment. If, 
e.g., two or more series of observations on similar material are given, 
the average of the combined series should be readily expressed in terms 
of the averages of the component series ; if a variable may be expressed 
as the sum of two or more others the average of the whole should be 
readily expressed in terms of the averages of its parts. A measure for 
which simple relations of this kind cannot be readily determined is likely 
to prove of somewhat limited application. 

* 

5.6 There are three forms of average in common use, the arithmetic 
mean, the median and the mode, the first named being by far the most 
Mddely used in general statistical work. To these may be added the 
geometric mean and the harmonic mean, more rarely used, but of service 
in special cases. We will consider these in the order named. 

The arithmetic mean 

5.7 The arithmetic mean of a series of values of a variable X^, X^, 
X^, . . . Xn, N in number, is the quotient of the sum of the values by 
their number. That is to say, if M be the arithmetic mean, 

ilf = i(Xi+A-,+A',+ . . . +Xn) 

The arithmetic mean is also denoted by placing a bar over the variate 
symbol, so that we may also Mrrite — 

X « ~(^t+A-a+ . . . +Xs) 

N 
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To express these formula more briefly by the use of the summation 
symbol S, 

X = M = i2(X) (5.1) 

The word mean or avenge alone, without qualification, is very generally 
used to denote this particular form of av( ^ .rge ; that is to say, when anyone 
speaks of the mean or the average of a series of observations, it 
may, as a rule, be assumed that the arithmetic mean is meant. 

5.8 It is evident that the arithmetic mean fulfils the conditions laid 
down in (a) and (6) of 5.5, for it is rigidly defined and based on all the 
observations made. Further, it fulfils condition (c), for its general nature 
is readily comprehensible. If the wages-bill for N workmen is £P, the 
arithmetic mean wage, P jN pounds, is the amount that each would 
receive if the whole sum available were divided equally between them : 
conversely, if we are told that the mean wage is we know this means 
that the wages-bill is NM pounds. Similarly, if N families possess a total 
of C children, the mean number of children per family is C jN — the number 
that each family would possess if the children were shared uniformly. 
Conversely, if the mean number of children per family is M, the total 
number of children in N families is NM, The arithmetic mean expresses, 
in fact, a simple relation between the whole and its parts. 

The mean is also satisfactory as regards conditions {e) and (/), but we 
shall have to defer proof of this statement for the present. 

Calculation of the.arithmetic mean 

5.9 As regards condition (d), simplicity of calculation, the mean takes 
a high place. In the cases just cited, it will be noted that the mean is 
actually determined without even the necessity of determining or noting 
all the individual values of the variable : to get the mean wage we need not 
know the wages of every hand, but only the wages-bill ; to get the mean 
number of children per family we need not know the number in each 
family, but only the total. If this total is not given, but wc have to deal 
with a moderate number of observations — so few (say 30 or 40) that it is 
hardly worth while compiling the frequency-distribution — the arithmetic 
mean is calculated directly as suggested by the definition, i.e. aU the values 
observed are added together and the total divided by the number of 
observations. 

5.10 But if the number of observations be large, the process of adding 
together all the values of the variate may be prohibitively lengthy. It 
may be shortened considerably by forming the frequency-table and treat- 
ing all the values in each class as if they were identical with the mid-value 
of the class-interval, a process which in general gives an approximation 
that is quite sufficiently exact for practical purposes if the class-interval 
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has been taken moderately small. In this process each class-frequency 
is multiplied by the mid- value of the interval, the products added together, 
and the total divided by the number of observations. If / denote the 
frequency of any class, X the mid- value of the corresponding class-interval, 
the value of the mean so obtained may be written — 

M=1S(/Z) (5.2) 

5.11 But this procedure is still further abbreviated in practice by the 
following artifices : (1) The class-interval is treated as the unit of measure- 
ment throughout the arithmetic ; (2) the difference between the mean 
and the mid-value of some arbitrarily chosen class-interval is computed 
instead of the absolute value of the mean. 

If be the arbitrarily chosen value and 


X=A+g . (5.3) 

then 

or, since ^ is a constant, 

M^:A+^-L{M (5.4) 


The calculation of h(fX) is therefore replaced by the calculation of 
S{/S). The advantage of this is that the class-frequencies need only be 
multiplied by small integral numbers ; for A being the mid-value of a 
class-interval, and X the mid-value of another, and the class-interval being 
treated as a unit, the g’s must be a series of integers proceeding from zero 
at the arbitrary origin A. To keep the values of § aa small as possible, A 
should be chosen near the middle of the range. 


It may be mentioned here that or grouped dis- 


tribution, is sometimes termed the first moment of the distribution about 
the arbitrary origin A, 


Example 5.1. — As an example, let us find the arithmetic mean of the 
heights in the ‘‘total” column of Table 4.7. In this case the class-interval 
is a unit (1 inch), so the value of M —A is given directly by dividing £(/!) 
by IV. The student must notice that, measures having been made to the 
nearest dghth of an inch, the mid- values of the intervals are 57^, 58^. 
etc.* and ndt 57*5, 58*5, etc. 
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CalcnlafioB of the arKhnetic auan statyire of male adults in the British Isles from the 

figures of Table 4.7, p. 82 


(1) 

Height, 

inches 

(2) ! 
Frequency 

(3) 1 

Deviation ' 
from arbitrary 
value A 

(4) 

Product 

fi 

57- 

2 

-10 

- 20 

5a~ 

4 

- 9 

- 36 

59- 

14 

- 8 

- 112 

60- 

41 

- 7 

- 287 

61- 

83 

- 6 

- 498 

62- 

169 

- 5 

- 845 

63- 

394 

- 4 

-1576 

64- 

669 

- 3 

-2007 

65- 

990 

- 2 

-1980 

66- 

1223 

- 1 

-1223 

67- 

1329 

0 

1 

-8584 

68- 

1230 

1 + 1 

1230 

69- 

1063 

! + 2 

2126 

70- 

646 { 


1938 

71- 

392 1 

1 + 4 

1568 

72- 

202 1 


1 1010 

73- 

79 1 

! + 6 

1 474 

74- 

32 

i + 7 

1 224 

75- 

I 16 

i -I- 8 

j 128 

76- 

! 5 

1 + 9 

i 45 

77- 

1 2 

\ +10 

1 20 

Total 

i 8585 

1 

i 

j -f8763 

1 


S(/|) = -f 8,763 -8,584 = +179 
179 

M—A = + ^— ^ = +0-02 class-intervals or inches. 
.-. M = 67 *+0-02 = 67-46 inches. 


5.12 As calculations of the mean constantly have to be made, the 
student should familiarise himself with the process we have just illustrated, 
and note that a check can always be effected on the arithmetic in the 
following way — 

Since /(g+1) =/C+/ 

2 :/(£+!): =2(/g)+El/) 

s;/(^+i):-s(/0=s(/i 

= Total frequency 

Hence, if we tabulate the values of /(g+1) as well as those of fi and find 
their totals, the difference must, if the arithmetic is correct, be equal to 
the total frequency. 
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5.13 It will be evident that a classification by unequal intervals is, 
at best, a hindrance in the calculation of the mean, and the use of an 
indefinite interval at the end of the distribution renders exact calculation 
impossible. The following example illustrates the calculation for unequal 
class-intervals and the arithmetical check to which we have just referred. 

Example 5.2. — Data from Table 4.11, page 89. What is the average 
age at death from scarlet fever ? 

Here there is a change of the class-interval at the five-year point. We 
take a year to be the unit, and the centre of the interval 5-10 years as an 
arbitrary origin, which means that A =7*5 years. 

Calculation of the arithmetic mean age of persons dying from scarlet fever In the 
United Kingdom in 1933 (Table 4.11, p. 89) 


Age 

Years 

Frequency 

/ 

Deviation from A 
£ 

/£ 

/(^+1) 

0- 

16 

—7 

- 112 

- 96 

1- 

69 

-6 

-- 414 

- 345 

2- 

89 

-5 

- 445 

- 356 

3- 

74 

-4 

- 296 

- 222 

4- 


-3 

- 222 

- 148 

5- 

213 

0 

-1489 

-1167 





213 

10-^ 

70 

5 

350 

420 

15- 

27 

10 

270 

297 

20- 

26 

15 

390 

416 

25- 

17 

20 

340 

357 

30- 

12 

25 

300 

312 

35- 

11 

30 

330 

341 

40- 

10 

35 

350 

360 

45- 

6 

40 

240 

246 

50- 

7 

45 

315 

322 

55- 

5 

50 

250 

255 

60- 

— 

55 

— 

— 

65- 

1 

60 

60 

61 

70- 

1 

65 

65 

66 

75- 

1 

70 

70 

71 

Total 

729 

— 

+ 3330 

+ 3737 


Hence, 

and 


S(/^ =3330- 1489 = 1 841 

=3737-1167 =2570 


and the difference 2570—1841=729, as it should. 


Hence, 


J|f-^=~=2-52S years 
729 


and 


M=7-5+2*525=10-025 years 


»Si, 




AVERAGES 


XO9 


5.14 We return again below, in 5.16 {c), to the question of the errors 
caused by the assumption that all values within the same interval may be 
treated as approximately the mid-vadue of the interval. It is sufficient to 
say here that the error is in general very small and of uncertain sign for a 
distribution of the symmetrical or only moderately asymmetrical type, 
provided of course, the class-interval is not large. In the case of the 
" J-shaped or extremely asymmetrical distribution, however, the error is 
evidently of definite sign, for in all the intervals the frequency is piled up 
at the limit lying towards the greatest frequency, i.e. the lower end of the 
range in the case of the illustrations given in Chapter 4, and is not evenly 
distributed over the interval. In distributions of such a type the intervals 
must be made very small indeed to secure an approximately accurate value 
for the mean. The student should test for himself the effect of different 
groupings in two or three different cases, so as to get some idea of the degree 
of inaccuracy to be expected. 

5.15 If a diagram has been drawn representing the frequency-distribution, 
the position of the mean may conveniently be indicated by a vertical 
through the corresponding point on the base. In a moderately asym- 
metrical distribution the mean lies on the side of the greatest frequency 
towards the longer ‘‘ tail of the distribution : M in fig. 5.2 shows the 



Fig. 5.2.— Mean M, median Mi and mode Mo of the ideal moderately asymmetrical 

distribution 


position of the mean in an ideal distribution. In a symmetrical distribu- 
tion the mean coincides with the centre of symmetry. The student should 
mark the position of the mean in the diagram of every frequency-dis- 
tribution that he draws, and so accustom himself to thinking of the mean 
not as an abstraction, but always in relation to the frequency-distribution 
of tihe variable concerned. 
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Properties of the arithmetic mean 

$♦16 The following are important properties of the arithmetic mean, 
and the examples illustrate the facility of its algebraic treatment — 

(a) The sum of the deviations from the mean, taken with their proper 
signs, is zero. 

This follows at once from equation (5.4) : for if M and A are identical, 
evidently S(/^) must be zero. 

(b) If a series of N observations of a variable X consist of, say, two 
component series, the mean of the whole series can be readily expressed 
in terms of the means of the two components. For if we denote the values 
in the first series by and in the second series by X 2 , 

sw=s(x,)+5:(x,) 

that is, if there be observations in the first series and in the second, 
and the means of the two series be M|, respectively, 

NM^N^M^A-N^M^ .... (5.5) 

For example, we find from the data of Table 4.7, 

Mean stature of the 346 men born in Ireland =67 *73 inches 
„ „ „ 741 „ „ Wales =66*62 „ 

Hence the mean stature of the 1087 men born in the two countries is given 
by the equation 


1087M=(346 x 67*78)-f (741 x66*62) 
that is, M =66 • 99 inches. 

It is evident that the form of the relation (5.5) is quite general 1 if 
there are r series of observations X^, X^, . . . X^, the mean M of the 
whole series is related to the means Afj, A/g, . . . of the component 
series by the equation 

NM * M) 

For the convenient checking of arithmetic, it is useful to note that, if the 
same arbitrary origin A for the deviations g be taken in each case, we must 
have, denoting the component series by the subscripts 1, 2, ... r as before, 

2:(/g) -2(A£i)+£(AQ+ . . . - . (5.7) 

The agreement of these totals accordingly checks the work. 

As an important corollary to the general relation (5.0), it may be noted 
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that the approximate value for the mean obtained from any frequency- 
distribution is the same whether we assume (1) that all the values in any 
class are identical with the mid-value of the class-interval, or (2) that the 
mean of the values in the class is identical with the mid-value of the class- 
interval. 

(c) The mean of all the sums or differences of corresponding observa- 
tions in two series (of equal numbers of observations) is equal to the sum 
or difference of the means of the two series. 

This follows almost at once. For if 

X=Xi±X^ 

2:(X)=s(x,)±s(a:*) 

That is, if M, M^, M, be the respective means, 

(5.8) 

Evidently the form of this result is again quite general, so that if 

X=:X,±X^± . . . ±Xr 

M=^M,±M,± . . . ±Mr . (5.9) 

As a useful illustration of equation (5.8), consider the case of measurements 
of any kind that are subject (as indeed all meaisures must be) to greater or 
less errors. The actual measurement X in any such case is the algebraic 
sum of the true measurement Xj and an error Xj. The mean of the actual 
measurements M is therefore the sum of the true mean M^, and the 
arithmetic mean of the, errors Mg. If, and only if, the latter be zero, will 
the observed mean be identical with the true mean. Errors of grouping 
(5.14) are a case in point. 

The Median 

5.17 The median may be defined as the middlemost or central value 
of the variable when the values are ranged in order of magnitude, or as the 
value such that greater and smaller values occur with equal frequency. In 
the case of a frequency-curve, the median may be defined as that value of 
the variable the vertical through which divides the area of the curve into 
two equal parts, as the vertical through Mi in fig. 5.2. 

The median, like the mean, fulfils the conditions (5) and (c) of 5.5, seeing 
that it is based on all the observations made, and that it possesses the 
simple property of being the central or middlemost value, so that its 
nature is obvious. 

5.18 But the definition does not necessarily lead in all cases to a deter- 
minate value. If there be an odd number of different values of X observed, 
say 2n-f 1, the («-l-l)th in order of magnitude is the only value fulfilling 
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the definition. But if there be an even number, say 2« different values, 
any value between the «th and (M+l)th fulfils the conditions. In such 
a case it appears to be usual to take the mean of the nth and (n4-l)th 
values as the median, but this is a convention supplementary to the 
definition. 

5.19 It should also be noted that in the case of a discontinuous variable 
the second form of the definition in general breaks down : if we range 
the values in order there is always a middlemost value (provided the 
number of observations be odd), but there is not, as a rule, any value such 
that greater and less values occur with equal frequency. Thus, in Table 
4.2 we see that 45 per cent of the poppy capsules had 12 or fewer stigmatic 
rays, 55 per cent had 13 or more ; similarly, 61 per cent had 13 or fewer 
ra)rs, 39 per cent had 14 or more. There is no number of rays such that 
the frequencies in excess and defect are equal. In the case of the butter- 
cups of Exercise 4.5 (i), page 100, there is no number of petals that even 
remotely fulfils the required condition. An analogous difficulty may arise, 
it may be remarked, even in the case of an odd number of observations of a 
continuous variable if the number of observations be small and several of 
the observed values identical. 

The median is therefore a form of average of most uncertain meaning in 
cases of strictly discontinuous variation, for it '-"y be exceeded by 5, 10, 
15 or 20 per cent only of the observed values, instead of by 50 per cent : 
its use in such cases is to be deprecated, and is perhaps best avoided in any 
case, whether the variation be continuous or discontinuous, in which small 
series of observations have to be dealt with. 

Determination of the median 

5.20 When all the values of the variate are given and the total frequency 
is small, the median can be determined by inspection as the middlemost 
value or, if there is no such value, as the mean of the two middlemost 
values. When the distribution is given as a frequency-distribution, 
however, a certain amount of approximation is necessary, as in the case 
of the calculation of the mean. 

For the frequency-distribution of a continuous variable a sufficiently 
approximate value of the median can be obtained by interpolation. If 
the total frequency is large it is sufficient to assume that the values in each 
class are uniformly distributed throughout the interval. 

Example 5.3. — Let us determine the median of the distribution whose 
mean we found in Example 5.1. The work may be indicated thus — 

Half the total number of observations (8585) . = 4292-5 

Total frequency under 66'}| inches . . = 3589 

Difference = 703-5 

Frequency in next interval . . . . « 1329 
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Hence we take the median to be — 


66^4 


703-5 

1329 


Xl 


= 67-47 inches 


The difference between the median and mean in this case is therefore 
only about one-hundredth of an inch. 

Example 5.4. — To find the median of the distribution of Example 5.2. 


Half the total number of observations = 364-5 

Total frequency under 5 years = 322 

Difference =42-5 

Frequency in next interval . . = 213 


Hence we take the median to be- 


. , 42-5 - 
= 6 years 


Here the median is very far from coinciding with the mean. 


Graphical determination of the median 

5.21 Graphical interpolation may, if desired, be substituted for arith- 
metical interpolation. Taking the figures of Example 5.1, we see that 
the number of men with height less than 65|| is 2366, less than 66^ 
is 3589, less than 67^ is 4918, and less than 68^ is 6148. 

Plot the numbers of men with height not exceeding each value of X 
to the corresponding value of X on squared paper, to a good large scale, 
as in fig. 5.3, and draw a smooth curve through the points th\is obtained, 
preferably with the aid of one of the “ curves,” splines or flexible curves 
sold by instrument-makers for the purpose. The point at which the 
smooth curve so obtained cuts the horizontal line corresponding to a 
total frequency N /2 =4292 -5 gives the median. In general the curve is 
so flat that the value obtained by this graphical method does not differ 
appreciably from that calculated arithmetically (the arithmetical process 
assuming that the curve is a straight line between the points on either 
side of the median) ; if the curvature is conaderable, the graphical value 
— assuming, of course, careful and accurate draughtsmanship — ^is to be 
preferred to the arithmetical value, as it does not involve the crude 
assumption that the frequency is uniformly distributed over the interval 
in which the median lies. 
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Height (indies) 

Fig. 5.3. — Determination of the median by graphical inteqxdation 

Comparison of the mean and the median 

5.22 If we adopt the convention that the median of an even number 

of observations is midway between the two central values, both the 
mean and the median satisfy the first three of the desiderata we enumerated 
in 5.5 ; that is to say, they are rigidly defined, based on all the observa- 
tions, and are readily comprehensible. In the remaining three, however, 
they differ considerably. - 

5.23 As regards ease of calculation, the median has distinct advan- 
tages over the mean. 

Whether the stability |)f the median under fluctuations of sampling 
is greater than that of t^e mean depends to some extent on the form 
of the distribution whicfl is being sampled. In general, the mean is 
the more stable, but cases occur in which the median is preferable (cf. 

5.24 {d) below, and Chap. 18). 

When, however, the ease of algebraical treatment of the two forms 
of average is compared, the superiority lies wholly on the ade of the mean. 
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As was shown in 5.16, when several series of observations are combined 
into a single series, the mean of the resultant distribution can be simply 
expressed in terms of the means of the components. Expression of 
the median of the resultant distribution in terms of the medians of the 
components is, however, not merely complex and difficult, but usually 
impossible : the value of the resultant median depends on the forms of the 
component distributions, and not on their medians alone. If two sym- 
metrical distributions of the same form and with the same numbers of 
observations, but with different medians, be combined, the resultant median 
must evidently (from S 3 m(imetry) coincide with the resultant mean, i.e. lie 
half-way between the means of the components. But if the two com- 
ponents be as 3 mimetrical, or (whatever their form) if the degrees of 
dispersion or numbers of observations in the two series be different, the 
resultant medi^ will not coincide with the resultant mean, nor with 
any other simply assignable value. It is impossible, therefore, to give 
any theorem for medians analogous to equations (5.5) and (5.6) for 
means. It is equally impossible to give any theorem analogous to 
equations (5.8) and (5.9) of 5.16. The median of the sum or difference 
of pairs of corresponding observations in two series is not, in general, 
equal to the sum or difference of the medians of the two series ; the 
median value of a measurement subject to error is not necessarily identical 
with the true median, even if the median error be zero, i.e. if positive 
and negative errors be equally frequent. 

5.24 These limitations render the applications of the median in any 
work in which theoretical considerations are necessary comparatively 
circumscribed. On the other hand, the median may have an advantage 
over the mean for special reasons. 

(a) It is very readily calculated ; a factbrto which, however, as already 
stated, too much weight ought not to be attached. 

(5) It is readily obtained, without the necessity of measuring aU the 
objects to be observed, in any case in which the objects can be arranged 
in order of magnitude. If, for instance, a number of men be ranked in 
order of stature, the stature of the middlemost is the median, and he 
alone need be measured. (On the other hand, it is useless in the cases 
cited at the end of 5.8 ; the median wage cannot be found from the 
total of the wages-bill, and the total of the wages-bill is not known whm 
the median is given.) 

(c) It is sometimes useful as a makeshift, when the observations are 
so given that the calculation of the mean is impossible, owing, e.g., to a 
find indefimTe class. 

(d) The median may sometimes be preferable to the mean, owing to 
its being less affected by abnormally large or small values of the variable. 
1l%e stature of a giant would have no more influence on the median 
stature of a number of men than the stature of any other man whose 
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height is only just greater than the median. If a number of men enjoy 
incomes closely clustering round a median of £500 a year, the median 
will be no more affected by the addition to the group of a man with an 
income of £50,000 than by the addition of a man with an income of £5,000, 
or even £600. If observations of any kind are liable to present occasional 
greatly outl}dng values of this sort (whether real, or due to errors or 
blunders), the median will be more stable and less affected by fluctuations 
of sampling than the arithmetic mean (cf. Chap. 18). 

{e) It may be added that the median is, in a certain sense, a particu- 
larly real and natural form of average, for the object or individual that 
is the median object or individual on any one system of measuring the 
character with which we are concerned will remain the median on any 
other method of measurement which leaves the objects in the same relative 
order. Thus a batch of eggs representing eggs of the median price, 
when prices are reckoned at so much per dozen, will f-emain a batch 
representing the median price when prices are reckoned at so many eggs 
to the shilling. 

The mode 

5.25 The mode is the value of the variable corresponding to the maximum 
of the ideal curve which gives the closest possible fit to the actual dis- 
tribution. It represents the value which is most frequent or typical, 
the value which is, in fact, the fashion {la mode)} The mode is sometimes 
denoted by writing the sign ^ over the variate symbol, e.g. X means 
the mode of the values Xj, Xg, . . . Xj,^. 

There is evidently something anticipatory about this definition, for 
we have not yet defined what we mean by ** closest possible fit." For 
the present the student must content himself with intuitive ideas on this 
head. Nor have we given a method of finding the curve of closest fit, 
which w’ould be a necessary preh’minary to ascertaining the mode. 

5J26 It is, in fact, difficult to determine the mode for such distributions 
as arise m practice, particularly by elementary methods. It is no use 
giving merely the mid-value of the class-interval into which the greatest 
frequency falls, for this is entirely dependent on the choice of the scale 
of class-intervals. It is no use making the class-intervMs very small 
to avoid error on that account, for the class-frequencies will then become 
small and the distribution irregular. What we want to arrive at is the 
imd-value of the interval for which the frequency would be a maximum, 
if the intervals could be made indefinitely small, and at the same time 
the number of observations be so increased that the class-freqnendes 

^ Uzile^ we state expressly to the contrary, we shall be thinking of single-humped 
distributions in talking of “ the ” mode. When the distribution is of the complicated 
form ^ fig. 4.17 there may be more than one mode. Such distributions are therefore 
sometimes called multimodal. The mean and the median are still unique for such 
distributions. 
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should run smoothly. As the observations cannot, in a practical case, 
be indefinitely increased, it is evident that some process of smoothing 
out the irregularities that occur in the actual distribution must be adopted, 
in order to ascertain the approximate value of the mode. But there is 
only one smoothing process that is really satisfactory, in so far as every 
observation can be taken into account in the determination, and that 
is the method of fitting an ideal frequency-curve of given equation to 
the actual figures. The value of the variable corresponding to the 
maximum of the fitted curve is then taken as the mode, in accordance 
with our definition. The determination of the mode by this — ^the only 
strictly satisfactory — method must, however, be left to the more advanced 
student. The methods of curve-fitting which we shall discuss in Chapter 15 
are not appropriate to the fitting of frequency-curves, but we give an 
approximate method which is of use in certain cases in 25.21. 

Empirical relation between mean, median and mode 
5.27 For a symmetrical distribution, mean, median and mode coincide, 
as will be evident on a little consideration. For other distributions, as 
a rule, they do not. Fig. 5.2 shows the position of the three in a 
moderately skew distribution. 

There is an approximate relation between mean, median and mode 
which appears to hold good with surprising closeness for moderately 
asymmetrical distributions, approaching the ideal type of fig. 4.7, and it 
is one that should be borne in mind as giving — roughly, at all events — 
the relative values of these three averages for a great many cases with 
which the student will have to deal. It is expressed by the equation 

Mode = Mean— 3(Mean— Median) 

That is to say, the median lies one-third of the distance mean to mode 
from the mean towards the mode. The student will find it easy to 
remember this relation if he notes that mean, median and mode occur 
in the same order (or the reverse order) as in the dictionary, and that the 
median is nearer to the mean, also as in the dictionary. 

The following table gives the true mode and the mode calculated in 
accordance with the above formula for certain skew distributions of the 
type of fig. 4.10 — 


Comparison of the apprroimate and true modes in the case of five distrlbutfams of Oic 
hei^t of tte barometer for dally observatioas at the stathms named 

(Distributions given by Karl Pearson and Alice Lee, PhU. Tram., A. 1897| 190» 423) 


Station 

Mean 

Median 

Approximate 

Mode 

True Mode 

Southampton . 
Londonderry . 
Carmarthen 
Glasgow • 

Dundee . 

29*981 

29-891 

29*952 

29*886 

29*870 

HESBlSSHi 

30*038 

29*963 

30*018 

29*946 

29*930 

30*039 

29*960 

30*013 

29*987 

29«9St 
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It will be seen that the true and approximate values are extremely 
close, except in the case of Dundee and Glasgow, where the divergence 
reaches two-hundredths of an inch. 

5.28 Summing up the preceding paragraphs, we may say that the mean 
is the form of average to use for all general purposes ; it is simply cal- 
culated, its value is nearly always determinate, its algebraic treatment is 
particularly easy, and in most cases it is rather less affected than the 
median by errors of sampling. The median is, it is true, somewhat more 
easily calculated from a given frequency-distribution than is the mean ; 
it is sometimes a useful makeshift, and in a certain class of cases it is 
more and not less stable than the mean ; but its use is undesirable in 
cases of discontinuous variation, its value may be indeterminate, and its 
algebraic treatment is difficult and often impossible. The mode, finally, 
is a form of average hardly suitable for elementary use, owing to the 
difficulty of its determination, but at the same time it represents an 
important value of the variable. The arithmetic mean should invariably 
be .employed unless there is some very definite reason for the choice of 
another form of average, and the elementary student will do very well 
if he limits himself to its use. Objection is sometimes taken to the use 
of the mean in the case of asymmetrical frequency-distributions, on the 
ground that the mean is not the mode, and that its value is consequently 
misleading. But no one in the least degree familiar with the manifold 
forms taken by frequency-distributions would regard the two as in general 
identical ; and while the importance of the mode is a good reason for 
stating its value in addition to that of the mean, it cannot replace the 
latter. The objection, it may be noted, would apply with almost equal 
force to the median, for, as we have seen (5.27), the difference between 
mode and median is usually about two-thirds of the difference between 
mode and mean. 

The geometric mean 

5.29 The geometric mean C of a series of values X^, Xj. . . . Xjy 
is defined by the relation 

, . (5.10) 


The definition may also be expressed in terms of logarithms — 

logG=^aog X) . (5.11) 

as to say, the logarithm of the geometric mean of a series of values 
is the arithmetic mean of thdr logarithms. - 
The geometric mean of a given series of quantities is always less than 
their arithmetic mean ; the student will find a proof in most teactbooks 
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of algebra. The magnitude of the difference depends largely on the amount 
of dispersion of the variable in proportion to the magnitude of the mean 
(cf. Exercise 6.12, p. 150). The geometric mean is necessarily zero, it 
should be noticed, if even a single value of X is zero, and it may become 
imaginary if negative values occur. 

Calculation of the geometric mean 

5^ From equation (5.11) it wiU be evident that the calculation of 
the geometric mean is exactly the same as that of the arithmetic mean 
except that instead of adding the values of the variable we add the 
logarithms of those values. If there are many values we can draw up 
a frequency table for the logarithms and proceed as in Examples 5.1 
and 5.2. 

Pr(q>erties of the geometric mean 

5.31 The geometric mean is rigidly defined and takes account of all 
the observations. It is also fairly easily calculated, though not so easily 
as the arithmetic mean. It has, however, no simple and obvious properties 
which render its general nature readily comprehensible. This, coupled 
with its rather abstract mathematical character, has prevented it from 
coming into general use as a representative average. 

5.32 At the same time, as the following examples show, the geometric 
mean possesses some important properties, and is readily treated 
algebraically in certain cases. 

(«) If the series of observations X consist of r component series, there 
being N-^ observations in the first, in the second, and so on, the geo- 
metric mean G of the whole series can be readily expressed in terms of 
the geometric means G„ etc., of the component series. For evidently 
we have at once (as in 5.16 {b )) — 

iVlogG=iV,logGi-f-iVjlogG,+ . . . -l-f^^ogG, (5.12) 

(&) The geometric mean of the ratios of corresponding observations 
in two series is equal to the ratio of their geometric means. For if 


X^XJXt 

logX=log .y,-logX, 

then summing for all pairs of X^’s and X,’s- 

G^GilG^ . . . (5.13) 

(c) Similarly, if a variable X is given as the product of any number of 
others, i.e. if 


X *= X^XtX^ . . . Xr 
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X^, X^, ... X„ denoting corresponding observations in r different series, 
the geometric mean G of X is expressed in terms of the geometric means 
Gi, Gj, . . . G, of /fj, X^, . . . Xr, by the relation 

G = G^G.G, G, (5.14) 

That is to say, the geometric mean of the product is the product of the 
geometric means. 

5.33 The geometric mean finds applications in several cases where 
we have to deal with a quantity whose changes tend to be directly pro- 
portional to the quantity itself, e.g. populations; or where we are dealing 
with an average of ratios, as in index-numbers of prices. Suppose, 
for instance, we wish to estimate the numbers of a population midway 
between two epochs (say two census years) at which the population is 
known. If nothing is known concerning the increase Of the population 
save that the numbers recorded at the first census were P, and at the 
second census n years later P„, the most reasonable assumption to make 
is that the percentage increase in each year has been the same, so that 
the populations in successive years form a geometric series, Pgf being 
the population a year after the first census. Pgr* two years after the first 
census, and so on, so that 

Pn^PiT (5.15) 

The population midway between the two censuses is therefore 

Pn/t = P,r-/' = (PgP„)i .... (5.16) 

i.e. the geometric mean of the numbers given by the two censuses. This 
result must, however, be used with discretion. The rate of increase of 
population is not necessarily, or even usually, constant over any con- 
siderable period of time particularly where immigration or emigration are 
serious factors. 

We shall have more to say about the geometric mean in Chapter 25, 
which deals with index-numbers. 

Hk luuriiioiiic m^n 

, 5.34 The harmonic mean of a series of quantities is the reciprocal of 
the arithmetic mean of their reciprocals ; that is, if ff be the harmonic 
mean. 



The following illustration will serve to show the method of calculation-— 
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Example 5.5. — The table gives the number of litters of mice, in certain 
breeding experiments, with given numbers {X) in the litter. (Data from 
A. D. Darbishire, Biometrika, 1903, 3, 30.) 


Number in 
litter 

X 

Number of 
litters 
/ 

fix 

1 

7 

7-000 

2 

11 

5-500 

3 

16 

5-333 

4 

17 

4-250 

5 

26 

5-200 

6 

.31 

5-167 

7 

11 

1-571 

8 

1 

0-125 

9 

1 

0111 


121 

34-257 


1 34-257 


Whence 


=0-2831 


H 121 



//=3-532 



The arithmetic mean is 4 •587, more than a unit greater. 

Reciprocal character of arithmetic and harmonic means 
5.35 Prices may be stated in two different ways which are reciprocally 
related, the resulting arithmetic mean of the one being the harmonic 
mean of the other. Supposing we had 100 returns of retail prices of eggs, 
50 returns showing six eggs to the shilling, 30 seven to the shilling, and 
20 five to the shilling ; then the mean number per shilling would be 
6-1, equivalent to a price of 1 •967d. per egg. But if the prices had been 
quoted in the form usual for other commodities, we should have had 50 
returns showing a price of 2d. per egg, 30 showing a price of 1 ■714d. and 
20 a price of 2-4d. ; arithmetic mean l‘994d., a slightly greater value 
than the harmonic mean of 1-967. 

The harmonic mean of a series of quantities is always lower than the 
geometric mean of the same quantities, and a fortiori, lower than the 
arithmetic mean, the amount of difference depending largely on the 
magnitude of the dispersion relatively to the magnitude of the mean (cf. 
Exercise 6.13, p. 150). 


SUMMARY 

1. Measures of the location or position of a frequency-distribution are 
called averages. 
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2. There are three of average in general use, the mean (arithmetic, 
geometric and harmonic), the median and the mode. 

3. The arithmetic mean of N values X^, X„ . . . Xs is given by 

The geometric mean is given by 

C=(Jf, . . . Xw)Vw 

or logG=ls(log X) 

The harmonic mean is given by 



4. The median is the central value of the variable when the values are 
ranged in order of magnitude ; if the number of values is even, the median 
is conventionally tRken to be the arithmetic mean of the two central values. 

5. The mode is the value of the variate corresponding to the maximum 
of the ideal curve which gives the closest possible fit to the actual distribu- 
tion. 

6. For distributions of moderate skewness there is an empirical relation- 
ship between the mean, the median and the mode expressed by the equation 

Mode = Mean —3{Mean —Median) 


EXERCISES 

5.1 Verify the following means and medians from the data of Table 4.7, 
page 82 — 

stature in inches for adult males in 
England Scotland Wales Ireland 

Mean . . 67'31 68 55 66-62 “ 67-78 

Median . 67-35 68-48 66-56 67-69 

In the calculation of the means use the same arbitrary origin as in Example 

5.1 and check your work by the method of 5,16 (b). 

5.2 The mean of 13 numbers is 10, and the mean of 42 other numbers is 
16. • Find the mean of the 55 numbers taken together. 

5.3 Find the' mean weight of adult males in the United Kingdom from 
the data in the last column of Exercise 4.6, page 100. Find the median 
weight, and hence find the approximate mode by the rdation of 
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5.4 Similarly, find the mean, median and approximate value of tl^ mode 
for the distribution of fecundity in race-horses. Table 4.9, page 86. 

5.5 Using a graphical method, find the median income subject to sur- or 
super-tax in the financial year 1931 from the data of Table 4.5, page 77. 

5.6 Find the arithmetic mean of the first n natural numbers and show that 
it coincides with the median. 

5.7 (Data from Agricultural Statistics, England and Wales, Part 2, 1932.) 
The figures in columns 1 and 2 of the small table below show the index- 
numbers of prices of certain commodities in the harvest years 1926 and 
1931 , the years 1911-13 being taken as 100. In column 3 have been added 
the ratios of the index-numbers in 1931 to those in 1926, the latter being 
taken as 100. 

Find the average ratio of prices in 1931 to those in 1926 — 

(1) From the arithmetic mean of the ratios in column 3. 

(2) From the ratio of the arithmetic means of columns 1 and 2. 

(3) From the ratio of the geometric means of columns 1 and 2. 

(4) From the geometric mean of the ratios of column 3. 

Note that, by 5.32, the last two methods must give the same result. 



Index-number of price in 

Ratios 

Commodity 

1926 

1931 

*31 /’26 


(1) 

.. 

(2) 

(3) 

1. Wheat 

157 

79 

50-3 

2. Fat cattle . 

131 

118 

90-1 

3. Milk , . .! 

163 

139 

85-3 

4. Eggs , . .1 

149 

IIU 

73-8 

5. Fruit . .1 

165 

132 

80*0 

6. Vegetables ,j 

135 

158 

117-0 


5.8 Find the arithmetic and geometric means of the series 1, 2, 4, 8, 16, 
... 2*. Find also the harmonic mean. 

5.9 Supposing the frequencies of values 0, 1, 2, ... of a variable to 
be given by the terms of the binomial series 

?», • • • 
where p+q=^l, find the mean. 

5.10 Show that, in finding the arithmetic mean of a set of readings on a 
thermometer, it does not matter whether we measure temperature in 
Centigrade or Fahrenheit degrees, but that in finding the geometric mean 
it does matter. 
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5.11 (Data from Census of 1901.) The table below shows the population' 
of the rural sanitary districts of Essex, the urban sanitary districts (other 
than the borough of West Ham), and the borough of West Ham, at the 
censuses of 1891 and 1901. Estimate the total population of the county 
at a date midway between the two censuses, (1) on the assumption that 
the percentage rate of increase was constant for the county as a whole ; 
(2) on the assumption that the percentage rate of increase was constant 
in each group of districts and the borough of West Ham. 



Population 

Essex 

1891 

1901 

Rural districts . 

232,867 

240,776 

West Ham 

204,903 

267,358 

Other urban districts 

345,604 

575,864 

Total .... 

783,374 

1,083,998 


5.12 (Data from Agricultural Statistics, Part 2, 1932.) The following 
statement shows the monthly average prices of eggs in England and Wales 
in 1932, as compiled from returns from certain markets for National Mark 
Specials and English Ordinaries, First Quality, per 120 — 


Month 

N.M. Specials 

English Ordinaries, 
First Quality 


s. 

d. 

s. 

d. 

January 

18 

11 

15 

2 

February 

15 

0 

12 

11 

March .... 

11 

11 

10 

0 

April .... 

10 

10 

.9 

2 

May .... 

10 

9 

8 

9 

June .... 

12 

0 

10 

0 

July .... 

14 

2 

12 

6 

August .... 

15 

6 

13 

9 

September 

18 

10 

16 

3 

October 

20 

9 

18 

9 

November 

24 

1 


8 

December 

21 

2 

16 

10 

Mean for year 

16 

2 

13 

10 


What would have been the mean price for the year in each case if the 
wholesale prices had been recorded as retail prices sometimes are, i.e, at 
so many eggs per shilling ? State your answer in the form of the equivalent 
price per 120, and obtain it in the shortest way by taking the harmonic 
mean of the above prices. 
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MEASURES OF DISPERSION 


Range 

6.1 We can now turn to a consideration of measures of the dispersion 
of variate values about the central values we have discussed in the last 
chapter. 

The simplest possible measure of dispersion is the range, i.e. the difierence 
between the greatest and least values observed. The extreme ease with 
which this measure may be calculated and its very obvious interpretation 
have led to its use in many industrial problems. There are, however, 
objections to the .use of the range in fields where speed of calculation 
and simplicity of interpretation are not of paramount importance. 

In fact, the range is subject to fluctuations of considerable magnitude 
from sample to sample. There are seldom real upper or lower limits to 
the values which a variable can take, large or small values being only 
more or less infrequent. The occurrence of one of these infrequent values 
may have quite a disproportionate effect on the range. Suppose, for 
example, we consider the data of Exercise 4.6, page 100 shoiving the 
frequency-distributions of weights of adult males in several parts of the 
United Kingdom. In Wales one individual was observed with a weight 
of over 280 lb, the next heaviest being under 260 lb. The addition of 
this one exceptional man to 737 others has increased the range by some 
30 lb, or about 20 per cent. , 

Moreover, the range takes no account of the form of the distribution 
within the range. We might get the same value for the range from a 
symmetrical and a J-shaped frequency-curve. Clearly we could not regard 
two such distributions as exhibiting the same dispersion. 

6.2 In modern statistics the range finds its chief use in Quality Control, 
that is to say, the control of the average quality of a manufactured product. 
For instance, when a machine is turning out large numbers of a particular 
component, it is customary to examine a small sample of four or five 
taken at, say, half-hourly intervals to see whether the process is remaining 
constant within limits of error and is not altering by tool-wear or some 
such systematic change. The series of values of mean and range of the 
samples can easily be found by comparatively inexpert operators and 
are often sufficient to enable an adequate check to be kept on the 
process. 


us 
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6.3 A measure of dispersion should obey conditions similar to those 
we laid down for measures of location in the last chapter (5.5). That 
is to say, it should be based on all the observations, should be readily 
comprehensible, fairly easily calculated, affected as little as possible by 
fluctuations of sampUng, and amenable to algebraical treatment. 

There are three measures of dispersion in general use, the standard 
deviation, the mean deviation and the qnartile deviation or semi-interquartile 
range. We will consider them in that order. 

The standard deviation 

6.4 The standard deviation is the square root of the arithmetic mean 
of the squares of all deviations, deviations being measured from the arith- 
metic mean of the observations. If the standard deviation be denoted by 
a, and a deviation from the arithmetic- mean by x, then the standard 
deviation is given by the equation 

cr* = ls(^*) . (6.1) 

To square all the deviations may seem at first sight an artificial procedure, 
but it must be remembered that it would be useless to take the mere sum 
of the deviations, in order to obtain a measure of dispersion, since this sum 
is necessarily zero if deviations be taken from the mean. In order to 
obtain some quantity that shall vary with the dispersion, it is necessary to 
average the deviations by a process that treats them as if they were all of 
the same sign, and squaring is the simplest process for eliminating signs 
which leads to results of algebraical convenience. 

Root-mean-square deviation 

6.5 The standard deviation is a particular case of a more general quantity^ 

known as the root-mean-square deviation, which has theoretical im- 
portance. , 

Let A be any arbitrary value of X, and let i (as in 5.11) denote the 
deviation of X from A ; i.e. let 

i=^X-A 

Then we may define the root-mean-square deviation s from the origin A 
by the equation 

s* “ («.2) 

The standard deviation is the value of the root-mean-square deviation 
taken from the mean. 

6.6 The quantities o* and s*. i.e. the squares of the standard and root- 
mean-square deviations, are sufficiently important in much theoretical 
wtnlc to have special names. 
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The square of the standard deviation, o*. is called the variance. 

The quantity i.e. s*. is called the second moment about the 

value A. We have already seen (5.11) that the quantity :iT(g)is called 

A 

the first moment about A. and in the next chapter we shall consider 
moments of higher orders. 

Thus, the variance is the second moment about the mean. 


Relation between standard and root*mean-square deviations 

6.7 There is a very simple relation between the standard deviation 
and the root-mean-square deviation from any other origin. Let 


so that 


Then 


M-A =d 


(6.3) 


i = x+d 
P = x*+2xd+d^ 


^P) =S(**)+2dS(*)+Nd* 

But the sum of the deviations from the mean is zero, therefore the second 
term vanishes, and accordingly 

** = a*+d* (6.4) 

Hence the root-mean-square deviation is least when deviations are 
measured from the mean, i.e. the standard deviation is the least possible 
root-mean-square deviation. 

6.^ If cr and d are the two sides of a right-angled triangle, s is the 
hypotenuse. If, then, MH be the vertical through the mean of a frequency 
distribution (fig. 6.1), and MS be set off equal to the standard deviation 
(on the same scale by which the variable X is plotted along the base), 
SA will be the root-mean-square deviation from the point A. This 
construction gives a concrete idea of the way in which the root-mean- 
square deviation depends on the origin from which deviations are 
measured. It will be seen that for sn^ values of d the difference of 
s and a will be very minute, since A will lie very nearly on the circle 
drawn through M with centre 5 and radius SM ; slight errors in the 
inea n dne to approximations in calculation will not, therefore, appreciably 
affect the value of the standard deviation. 
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Fig. 6.1 

Cakulation of the standard deviation 

6.9 If we have, to deal with relatively few, say thirty or forty, ungrouped 
observations, the method of calculating the standard deviation is perfectly 
straightforward. It is illustrated by the figures below giving the minimum 
wage-rates for agricultural labourers in England and Wales at the begin- 
ning of 1936. 

First of all the mean is ascertained. Then we find the values of x by 
subtracting the mean from all values of the variable. Each difference is 
squared and the total, ^(x*), obtained. This total divided by the total 
frequency is the square of the standard deviation. 

In practice, we can simplify the arithmetic by working from an arbitrary 
value A instead of from the mean. Such a value is usually known as the 
" working mean." When we have found the mean-square deviation s* 
about A we can easily find the value of <j* from equation (6.4). 

Example 6.1 — Calculation of Standard Deviation for a short series of 
observations (49) ungrouped. Minimum weekly rates’ of wages for 
ordinary adult male agricultural workers in England and Wales as at 
1st January 1936. 

By inspection of the table opposite we see that the mean is in the neigh- 
bourhood of 32 shillings. We therefore take this as the working mean A . 
The column headed " Difference ” is the excess of the value of the variable 
over this value. The column headed " (Difference)* ’’ is the square of 
the excess. We find 

1 —79 

jy2(D = — = - 1 - 612 pence 

Hence the means 32 shillings— 1*612 pence 

ss 31 shillings 10*4 pence approximately. 
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Area 


Wage rates 

DifEerence 
g (pence) 

(Difie^ce)* 



s. 

d. 



Bedford and Huntingdon shires 


31 

6 

- 6 

36 

Berkshire .... 


31 

0 

-12 

144 

Bucks .... 


32 

0 

— 

— 

Cambridgeshire 


31 

6 

- 6 

36 

Cheshire .... 


32 

6 

6 

36 

Cornwall .... 


32 

0 

— 

— 

Cumberland .... 


32 

6 

6 

36 

Derbyshire .... 


36 

0 

48 

2,304 

Dorset .... 


31 

6 

- 6 

36 

Durham .... 


29 

0 

-36 

1,296 

Essex .... 


31 

0 

-12 

144 

Gloucester .... 


31 

0 

-12 

144 

Hampshire .... 


31 

0 

-12 

144 

Hereford .... 


31 

0 

-12 

144 

Hertford .... 


32 

0 

— 

— 

Kent .... 


33 

0 

12 

144 

Lancashire (South) 


32 

9 

9 

81 

,, (Rest) . 


36 

6 

54 

2,916 

I^eicester .... 


33 

0 

12 

144 

Lines (Holland) 


34 

0 

24 

576 

,, (Kesteven and Lindsey). 


31 

0 

-12 

144 

Middlesex . . * . 


33 

8 

20 

400 

Monmouth .... 


32 

0 

— 

— 

Norfolk .... 


31 

6 

- 6 

36 

Northants .... 


31 

6 

- 6 

36 

Northumberland 


31 

6 

— 6 

36 

Notts .... 


32 

0 

— 

— 

Oxfordshire .... 


31 

6 

- 6 

36 

Rutland .... 


31 

6 

- 6 

36 

Shropshire .... 


32 

0 

— 

— 

Somerset .... 


32 

6 

6 

36 

Staffs .... 


31 

6 

- 6 

36 

Suffolk .... 


31 

0 

-12 

144 

Surrey .... 


32 

3 

3 

9 

Sussex .... 


32 

0 

— 

— 

Warwickshire 


30 

0 

-24 

576 

Westmorland 


31 

0 

-12 

144 

Wiltshire .... 


31 

0 

-12 

144 

Worcester .... 


31 

0 

-12 

144 

Yorks, E. Riding 


33 

6 

18 

324 

,, N. Riding 


33 

0 

12 

144 

.. W. Riding 


33 

9 

21 

441 

Anglesey and Caernarvon 


31 

0 

-12 

144 

jCarmarthcn 


31 

6 

- 6 

36 

Denbigh and Flint . 


30 

6 

-18 

324 

Glamorgan 


33 

6 

18 

324 

Merioneth and Montgomery . 


28 

6 

-42 

1,764 

Pembroke and Cardigan 


31 

0 

-12 

144 

Riidnor and Brecon 


30 

0 

-24 

576 


-79 14,539 


Totals 
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Also 

^I(g*)Ji^?®=296-714=s* 

a*=s*-rf»=296-714-{l-612)* 

=294-112 

0=17-15 pence approximately. | 

We would direct the student's attention to the necessity for checking 
liis work at each stage before proceeding to the next. If he neglects this^ 
warning he is likely to learn by bitter experience how essential it was. 't 
For instance, in the above work it would be well to check the value of ' 
the mean by summing the wage rates and dividing by 49. We get in 
this way — 

1561s. 5d. .j 

Mean= — =31s. 10-4d. 

49 

which checks with the mean found from the working mean. Secondly, 
the squares of differences should be checked before they are added, and 
if the addition is made without a machine, a check should be cauried out 
by summing first from bottom to top and then from top to bottom, to 
avoid repeating errors. A further systematic check is given in 6.11 below. 

6.10 If we have to deal with a grouped frequency-distribution the 
same artifices and approximations are used as in the calculation of the 
mean (5.10 and 5.11). The mid-value of one of the class-intervals is 
chosen as the arbitrary origin A from which to measure the deviations i, 
the class-interval is treated as a unit throughout the arithmetic, and aU 
the observations within any one class-interval are treated as if they were 
identical with the mid-value of the interval. If, as before, we denote the 
frequency in any one interval by /, these / observations contribute to 
the sum of the squares of deviations, and we have — 

The standard deviation is then calculated from equation (6.4). 

6.11 As the arithmetic in calculating the standard deviation is often 
extensive, it is as well' to use some check similar to that of 512. In 
this case we have — 

(S+1)* ^ i*-h2£+l 
/{S41)*«/g*+2/S4-/ 

-•- =2(/f«)+2S(/|)-fiV 
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Hence, if we calculate S { f (€+!)*} as well as S(/ €*), the above equation 
gives us a simple check on the accuracy of our work. The following 
examples illustrate the method — 

Example 6.2 . — Calculation of the standard deviation of stature of male 
adults in the British Isles from the figures of Table 4.7, page 82. 


(1) 

Height 

inches 

(2) 

Frequency 

(3) 

Deviation 
from 
value A 
£ 

(4) 

Product 

fi 

(5) 

/(fi+1) 

(6) 

Product 

fV 

(7) 

/(fi+1)* 

57- 

2 

-10 


- 18 


162 

58- 

4 

- 9 

- 36 

- 32 

324 

256 

59- 

14 

~ 8 

- 112 

- 98 

896 

686 


41 

- 7 

- 287 

- 246 


1,476 

61- 

83 

- 6 

- 498 

- 415 

2,988 


62- 

169 

- 5 

- 845 

- 676 

4,225 


63- 

394 

- 4 

- 1,576 

- 1,182 

6,304 

3,546 

64— 

669 

- 3 


- 1,338 


2,676 

65- 

990 

- 2 





66- 

1,223 

- 1 

~ 1,223 

-4,995 

1,223 

— 

67- 

1,329 


-8,584 

1,329 

— 

1,329 

68- 


-f- 1 



1,230 

mm 

69- 


+ 2 

2,126 

3,189 

4,252 

9,567 


646 

+ 3 

1,938 

2,584 

5,814 


71- 

392 

+ 4 

1 .S 68 

1,960 

6,272 


72- 


+ 5 


1,212 


7,272 

73- 

79 

-f 6 

474 

553 

2,844 

3,871 

74- 

32 

+ 7 

224 

256 

1,568 


75- 

16 

-f 8 

128 

144 


1.296 

7e- 

5 

+ 9 

45 




77- 

2 


20 

22 


242 

Total 

8,585 

— 

8,763 

13,759 

56,809 

65,752 


S(/g)- 8,763-8.584- 179 
£{ /(£+l)l =“13.759-4,995 -8,764 


This is an example we have already considered when calculating the 
mean, and the work of the first four columns is the same as that of Example 
5.1, page 107. 

As a check on 2(/ 1) we have — 


Sj/(f+l);-S(/^) =8764-179 
=-8585 
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As a check on 1,{f we have — 

2|/(?+l)*l -£(/gs)-2i:(/g) =65,752 -56,809-358 

= 8,585 
= N 

From previous work, M—^=i= +0-0209 class-intervals or inches. 

S(/H_ 56,809. 

N 8,585 

a* = 6-6172-(0-0209)* 

= 6-6168 

<j = 2-57 class-intervals or inches. 


Example 6.3. — Let us find the mean and standard deviation of the 
distribution of Australian marriages given in Table 4.8, page 84. 

Calculation of standard deviation of age of bridegroom in a distribution 
of Australian marriages. 


Age of 
bridegroom 
(central value) 
Years 

Frequency 

j 

i 

fi 

/igf 1) 

/g* 

;{g + i)’ 

16-5 

294 

-4 

“ 1,176 

~ 882 

4.704 

2.646 

19-5 

10,995 

-3 

- 32,985 

- 21,990 

98,955 

43,980 

22-5 

61,001 

-2 

- 122.002 

- 61,001 

244,004 

61,001 

25*5 

73,054 

■~1 

- 73,054 

— 

73,054 


28-5 

56,501 

0 

— 

56,501 



56,501 

31*5 

33,478 

1 

33,478 

66,956 

33,478 

133,912 

34*5 

20,569 

2 

41,138 

61,707 

82,276 

185,121 

37*5 

14,281 

3 

42,843 

57,124 

128.529 

228,496 

40*5 

9,320 

4 

37.280 

46,600 

149,120 

233,000 

43*5 

6,236 

5 

31,180 

37,416 

155,900 

224,496 

46-5 

4,770 

6 

28.620 

33,390 

171.720 

233.730 

49-5 

1 3,620 

! 7 

25,340 

28.960 

1 177,380 

231,680 

52-5 

1 2,190 

1 ^ 

17.520 

19,710 

140,160 

177,390 

55-5 

1 1.655 

9 

14,895 

16,550 

134,055 

165,500 

58-5 

1,100 

10 

11,000 

12,100 

110,000 

133,100 

61-5 

1 810 

11 

8,910 

9,720 

98,010 

116,640 

64-5 

649 

12 

7.788 

8,437 

93,456 

109,681 

67-5 

487 

13 

6,331 

6,818 

82,303 

95,452 

70-5 

326 

14 

4,564 

1 4.890 

63,896 

73,350 

73-5 

211 

15 

3.165 

3,376 

47,475 

54,016 

76-5 

119 

16 

1,904 

2,023 

30,464 

34,391 

79-5 

73 

17 

1,241 

1,314 

21,097 

23,652 

82-5 

27 

18 

486 

513 

8.748 

9,747 

86-5 

14 

19' 

266 

280 

5,054 

5,1100 

88-5 

5 

20 

100 

105 

2,000 

2,205 

Total 

301,785 

— 

88,832 

390,617 

2.156,838 

2,635.287 
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We take a working mean ^4=28-5. 

As a check on 2(/^) we have — 

2 1 /{S+1) } -2(/g) =390.617 -88,832 
=301,785 
=N 


As a check on E(/S*) we have — 


S { /(^+1)*; -S(/g*)-2S(/g) =2,635,287 -2,155,838-177,664 

=301,785 

=N 

Then 

M QOO 

M-A=d =0 • 29436 interval 

301,785 


Hence, 

We have — 


=0-88308 year 
Af=29-383 years 

143622 intervals® 

301,785 

CT®=s®— d® intervals® 

=7-056974 intervals® 

O’ =2 -6565 intervals 
=7-969, or 8 years approximately. 


Sheppard’s correction for grouping 

6.12 The student must remember that the treatment of all the values 
of a variable in a class-interval as if they were concentrated at the centre 
of that interval is an approximation, although, for distributions of sym- 
metrical or moderately skew type and class-intervals not greater than 
about one-twentieth of the range, the approximation may be a very 
close one. 

It has been shown that if 

(«) the distribution of frequency is continuous, and 

(6) the frequency tapers off to zero in both directions, 
the variance obtained from grouped data may with advantage be corrected 
for the grouping effect by subtracting from it one-twelfth of the square 
of the class-interval ; i.e. if the class-interval be h units in width, a* the 
corrected value of the variance and a^* the value obtained from the 
grouped data— 


. (6.5) 
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The proof of this formula lies ootside the scope of this book. We may 
emphasise condition {b). The Sheppard correction is not applicable to 
J- or U-shaped distributions, or even to the skew form of fig. 4.7 (4), 
page 84. 

Furthermore, unless the total frequency is fairly large, the Sheppard 
correction is likely to be of secondary importance compared with fluctua- 
tions of sampling (see 19.13). We suggest that, as a general rule, the 
correction should not be made unless the frequency is at least 1,C 
or the grouping coarser than that given by intervals of about one-twentieth 
of the range. We give in Exercise 6.15 a result which will convey ^e 
general magnitude of the correction for the finer grouping. 

Example 6.4. — In Example 6.2 we have — 

a,»=6-6168 

Here A*=l, and A«/12=0 0833 
corrected value of a*=a,*-A*/12 

-6-6168-0 0833 


=6-5335 

and a corrected =2 -56, differing from the uncorrected value by 0-01. 
Example 6.5. — In Example 6.3 we have — 

<T*(uncorrected) =7 • 056974 intervals* 

Here a* is expressed in terms of A*, and hence to correct it we subtract 
A, giving 

o* (corrected) =6.973641 

o =2-6408 intervals 
=7-922 years 

as against an uncorrected value of 7 ■ 969 years. 

^>ead oi bbservattoos and standard deviation 

643 It is a useful empirical role to remember that a range of six 
times the ^andard deviation usually includes 99 per cent or more of all 
the observations in the case;^f distributions of the symmetrical or moder- 
ately asymmetrical type, thus in Example 6.2 the standard deviation 
is 2*57 in.^ six times this is 15-42 in., and a range from, say. 60 in. to 
75-4 in. includes all but some 36 out of, 8,585 individuals, i^e. about 
99*6 per cent. This roi^h rule serves to give a nune definite and concrete 
meaiung to the standard deviation, and also to check arithmetical work 
to soam .. extent— raScicnlOy, that is to say, to guard against very gross 
blunders. It inust not be ex|>ect^ to hold icir short series of observations ; 
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in Example 6.1. for instance, the actual range is a good deal less than 
six times the standard deviation. 

Properties of the standard deviation 

6.14 The standard deviation is the measure of dispersion which it is 
most easy to treat by algebraical methods, resembling in this respect 
the arithmetic mean amongst measures of position. The majority of 
illustrations of its treatment must be postponed to a later stage, but 
the work of 6.9 has already served as one example. We showed in 5.16 
that if a series of observations of which the mean is M consists of two 
component series, of which the means are and Af j respectively. 

Ni and JVj being the numbers of observations in the two component 
series, and N—N^+N^ the number in the entire series. Similarly, the 
standard deviation a of the whole series may be expressed in terms of 
the standard deviations and a* of the components and their respective 
means. Let 

Then the mean-square deviations of the component series about the mean 
M are. by equation (6.4), and respectively. Therefore, 

for the whole series 

d,») (6.6) 

If the numbers of observa.tions in the component series be equal and the 
means be coincident, we have as a special case — 

(6.7) 

so that in this case the variance (6.6) of the whole series is the arithmetic 
mean of the variances of its components. 

It is evident that the form of the relation (6.6) is quite general ; if a 
series of observations consists of r component serie|jiirith standa|^ devia- 
tions 9], 9^ . . . Of, and means diverging from general mean (rf 
the whole series by d^, . . . d,, the standard deviation 9 of the wfmle 

series is given (using m to denote any i^iubscript) by the equation 

. ( 6 . 6 ) 

Again, as la,S46, it is convenient iio bote, fw tlKS cbaddog of arij^etic. 
that if the Same arbiri-ary origin be used for the calcoliitiob tlmstandard 
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deviations in a number of component distributions, we must have — 


. (6.9) 

6.15 As another useful illustration, let us find the standard deviation 
of the first N natural numbers. The mean in this case is evidently 
(M+l) 12. Further, as is shown in any elementary algebra, the sum of 
the squares of the first N natural numbers is 

N(iV+l)(2iV+l) 


Applying equation (6.4) we have that the standard deviation a is giv^ 
by 

a*=-J(iV+l)(2iV+l)-l(iV+l)* 

that is, 

cT*=i(iV»-l) (6.10) 

This result is of service if the relative merit of, or the relative intensity 
of some character in, the different individuals of a series is recorded not 
by means of measurements, e.g. marks awarded on some system of 
examination, but merely by means of the respective positions when 
ranked in order as regards the character, in the same way as boys are 
numbered in a class. With N individuals there are always N ranks, as 
they are termed, whatever the character, and the standard deviation is 
therefore always that given by equation (6.10). 

Another useful result follows at once from equation (6.10), namely, the 
standard deviation of a frequency-distribution in which all values qf X 
within a range ±:l /2 on either side of the mean are equally frequent, 
values outside these limits not occurring, so that the frequency-distribution 
may be represented by a rectangle. The base I may be supposed divided 
into a very large number N of equal elements, and the standard deviation 
reduces to that of the first N natural numbers when N is made indefinitely 
large. The single unit then becomes negligible compared with N, and 
consequently 



6.16 It will be seen irpm the preceding paragraphs that the standard 
deviation possesses the majority at least of the properties which are 
desirable in a measure of dispersiim as in an average (5.S). It is rigidly 
defined ; it is based on all the observations made ; it is calculated with 
teaaeiiable ease ; it lends itself readily to algebraical treatment ; and we 
may add, though the studnat will have to take the statement on trust 
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for the present, that it is, as a rule, the measure least affected by fluctua- 
tions of sampling. On the other hand, it may be said that its general 
nature is not very readily comprehended, and that the process of squaring 
deviations and then taking the square root of the mean seems a little 
involved. The student will, however, soon surmount this feeling after a 
little practice in the calculation and use of the constant, and will realise, 
as he advances further, the advantages that it possesses. Such root- 
mean-square quantities, it may be added, frequently occur in other 
branches of .ecience. The standard deviation should always be used as 
the measure of dispersion, unless there is some very definite reason for 
preferring another measure, just as the arithmetic mean should be used 
as the measure of position. 

Note on nomenclature 

6.17 A great deal of confusion has been introduced into statistical 
literature by the many different expressions which have been used for 
the standard deviation and simple derivatives of it. It used to be almost 
a case of tot homines quot nomina, and as the student may meet these 
expressions elsewhere, we give a short list of them. The term " standard 
deviation ’* is now almost universally accepted, and in this book we shall 
use no other. 

" Mean error ” (Gauss), “ mean square error ” and " error of mean 
square " (Airy) have all been used to denote the standard deviation. 

The standard deviation is not to be confused with the " standard 
error.” We shall use this term in a special sense, that of the standard 
deviation of simple sampling (cf. 17.8). 

The standard deviation multiplied by the square root of 2 is also known 
as " the modulus." The student will see the reason for this multiplication 
later. The reciprocal of the modulus is called the " precision.” 

There is also a quantity known as the ” probable error,” which is 
defined as being 0- 67449 times the standard deviation (cf. 17.9). These 
last four quantities are particularly important in the theory of errors of 
observation and the theory of sampling. 

Finally, we may remark that since we shall use the expression 
" standard deviation ” very frequently, we shall sometimes use the 
abbreviation “ s.d.” or simply the s 3 rmbol o. 

Mean deviation 

6.18 We have already remarked that it would be useless to take the 
sum oi deviations from the mean as a measure of dispersion because such 
sum is identically zero. We therefore remove the signs of the deviations 
by squaring to reach the standard deviation. 

It is also possible to overcome this difficulty by adding the sum of 
deviations taken regardless of sign. The arithmetic mean , these 

“ absolute ” deviations is called the mean deviation. 
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If we write j 1 1 to denote the deviation from an arbitrary value A taken 
as positive whatever its actual sign, the mean deviation is thus defined as 

m.d.=iE{|£|) . (6.12) 

(The expression |£| is read "modi”- abbreviation for "the modulus 
off"). 

6.19 Just as the root-mean-square deviation is least when deviatidips 
are measured from the arithmetic mean, so the mean deviation is le^t 
when deviations are measured from the median. For suppose that, f^r 
.some origin exceeded by m values out of N, the mean deviation has a valulf 
A. Let the origin be displaced by an amount c until it is just exceeded by 
»»— 1 of the values only, i.e, until it coincides with the wth value from the 
upper end of the series. By this displacement of the origin the sum of 
deviations in excess of the origin is reduced by me, while the sum of 
deviations in defect of the mean is increased by {N—m)c. The new mean 
deviation is therefore 


A “1“ 


{N—m)c- me 
N 


^^+l(N-2m)e 

The new mean deviation is accordingly less than the old so long as 

m>^N 

That is to say, if N be even, the mean deviation is constant for 
origins within the range between the N /2th and the (iV /2-f l)th observa- 
tions, and this value is the least ; if N be odd, the mean deviation is lowest 
wdien the origin coincides with the (iV-fl)/2th observation. The mean 
deviation is therefore a minimum when deviations are measured from the 
median or, if the latter be indeterminate, from an origin within the range 
in which it lies. 

Calculation of the mean deviation 

6.20 The mean deviation is perhaps most easfly calculated about the 
mean, which is always iteterminate, except in the case of distributions with 
ms m^teiminate final class. As, however, it is a minhnnm sdsout the 
BOMiidiaii, we sometimes require to know the value about that point. The 
fcitowing examples will inake the method of calcnla^on deu'. 
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Exampk 6.6. — Let us find the mean deviation about the mean and 
about the median in the ungrouped data of Example 6.1. 

The data were arranged in alphabetical order of the county wage areas, 
which makes it a little ^fficult to ascertain the median by inspection. On 
rearranging in order of magnitude, we find that the median is the value 
31s. 6d. 

The deviations from the median value are, then, in order of magnitude 

—36, —30, —18, —18, —12, —6 (12 times), 0 (10 times), 

6 (7 times). 9, 12, 12, 12, 15, 18, 18, 18, 24, 24. 26, 27. 
30,54,60 

The sum of the negative deviations =—186 

The sum of the positive deviations = 401 

Hence the sum of absolute deviations = 587 


Hence m.d. = 


587 

49 


12 pence approximately. 


To find the m.d. about the mean, 31s. 10*4d., we note that the 27 
negative or zero deviations from the median would be increased by 4*4 
pence on transferring to the mean, and the 22 positive deviations decreased 
by 4*4 pence. The net effect on the total absolute deviations is then an 
increase of (27— 22) x 4 *4 pence=22 pence. 

Hence the m.d. about the mean is — 


^22 
49 ■*'49 

=12*43 pence 

Example 6.7. — Let us find the mean deviation of heights about the 
mean in the data of Example 6.2. 

In the case of a grouped frequency-distribution the sum of deviations 
diould first be calculated from the centre of the class-interval in whi<^ the 
mean (or median) lies and then redo<%d to the mean (or median} as 
origin. 

In this case the mean lies in the interval 67-. We found when calculat- 
ing it that the negative deviations totaled —8584 and the positive devia- 
tions 8768. Hence the sum of absolute deviations from the centre ol the 
interval is 17,347— the unit of measurement being the class-interval. 

To reduce to the mean as origin we note that if the number of observe-' 
tkms below the mean is Hi and above the mean N,, and M—A’mi as 
faetoe, we have to add to the sum when found and subtract In 
this case <1=0*02 class-intervd, =4,918 wd JV|=^667. 
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Hence we must add 

(4,918— 3,667) x0-2=+25 intervals 

i.e. the total of deviations =17,372 

and 

17 372 

ni.d.=— ^ — =2*02 intervals or inches. 

8,585 

The mean deviation from the median should be found in a similar way, 
the calculation being assisted if the class-interval in which the median 
lies is taken as origin. \ 

6.21 As in the case of the standard deviation, the above calculatioit^ 
assume for certain purposes that all the values of the variable can b^ 
treated as if they were concentrated at the centres of class-intervals. This 
gives sufficient accuracy for all practical purposes if the class-intervals are 
reasonably narrow. It has not been found possible to give any simple 
correction, such as Sheppard’s correction, for errors of grouping in the 
mean deviation, but we give at the end of this chapter an Exercise (6.11) as 
to the correction to be applied if the values in each interval are treated 
as if they were evenly distributed over the interval instead of being 
concentrated at its centre. 


Empirical relation between mean and standard deviations for symmetrical 
or moderately skew distributions 

6.22 It is a useful rule for the student to remember that for symmetrical 
or moderately skew distributions the mean deviation is about four-fifths 
of the standard deviation. Thus, for the distribution of male statures 
of Examples 6.2 and 6.7, we have — 


m.d. 

s.d. 


2-02 

'2-57 


:0-79 


For the short series of observations of Example 6.1 — 


m.d. 

s.d. 


12-43 

17-15 


=0-72 


Quartiles 

6.23 A natural extension of the idea of the median consists in ascer- 
taining the variate values and Q„ such that one-quarter of the observa- 
tions lies below and one-quarter above Q,. In this case clearly one- 
quarter lies between and Mi, the median, and one-quarter between Mi 
and Q,. 

Qi is termed the lower quartile and the upper quartile. The quartiles 
and the median thus divide the observed values of the variable into 
four classes of equal frequracy. 
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We saw that if the number of observations was even, there was an 
indeterminacy in the position of the median which required the additional 
convention that in such cases the median would be taken to be mid-way 
between the two central values. Similar indeterminacies may arise in 
fixing the quartiles unless the number of observations is one less than a 
multiple of four. Such cases are treated in an analogous way by supple- 
mentary conventions, which will be clear from the following examples. 

Example 6.8. — To determine the quartiles of the data of Example 6.1. 

Here there are 49 observations, and so the 25th gives the median. 
We regard half the 25th observation as falling below the median and half 
above. The lower quartile must divide into two equal parts the 24J 
observations falling below the median. The observatioiis other than the 
median are — 


28/6, 29/-, 30/-, 30/-, 30/6, 31/- (12 times), 31/6 (7 times). 

The lower quartile must divide the 24| observations into two sets of 
12J. The 12th and the 13th values are both, as it happens, 31 /-, and Q-^ 
being between the two is thus 31 /- also. 

The 24 observations between the median and the highest value are — 

31 /6 (twice), 32/- (7 times), 32/3, 32/6 (3 times), 32/9, 33/- (3 times), 
33 /6, 33 /6, 33 /8, 33 /9, 34 /-, 36 /-, 36 /6. 

The 12th and 13th observations are both 32/6, and hence this is the 
value of Q^. 

If the 12th and 13th observations had been, say, 32/6 and 33/-, we 
might have taken to be 32/6 but regarded J of the 12th observation 
as lying above that value. 


Example 6.9. — To determine the quartiles of the distribution of Example 

6 . 2 . 


Data of this kind are treated by simple arithmetical interpolation or 
graphical interpolation on the lines of 5.20 or 5.21. 

The quartiles are to divide the distribution into four equal parts. We 
have, therefore 


4 


=2146-25 


To the interval 65- are 1,376 individuals 

Difference =770 • 25 

770 * 25 

Hence, Qy is xav " beginning of the interval, which is 64^1. 

990 

^ 1 = 65-71 

Similarly, from the interval 70- onwards are 1,374 individuals. 
DiffeMnce from 2146-25=772-25. 
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Hence, 




772-25 

1063 


=69-21 inches 


It is left to the student to check the values by graphical interpolation. 

QaartOe deviatkm 

6.24 If Mi be the value of the median, in a symmetrical distiibu^on 

and the difference may be taken as a measure of dispersion. . But as no 
distribution is rigidly symmetrical, it is usual to take as the measure 


and Q is termed the quartile deviation, or better, the semi-interquartile 
range— it is not a measure of the deviation from any particular average. 
Thus, from the values calculated in Example 6.8 we have— 


^_32/6-31/-_18d. 
^ 2 2 


pence 


and from Example 6.9 we have — 




69-21 -65-71 


=1-75 inches 


Einpixical retetion between quartile and standaid deviations 
6.25 For symmetrical and moderately skew distributions the semi- 
interquartile range is usually about two-thirds of the standard deviation. 
Thus, for the height distribution of Examples 6.2 and 6.9, 


or 2-57 


= 0-68 


For the wage. statistics of Examples 6.1 and 6.8, 


Q- 9 
o 17'-15 


= =0-52 


which is considerably lower. We should, however, hardly have expected 
the comparatively few observations comprised in these data to conform at 
aB dosely to the emidrical rdation. 
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06 It follows from this relation that a range of 6 times the standard 
deviation corresponds to a range of 9 times the semi-interquartile range 
(and 7*5 times the mean deviation). Within these ranges we expect to 
find at least 99 per cent of the observations in symmetrical or moderately 
skew distributions. 

Ccnnpaiison of the three measures of dispersion 

6.27 The semi-interquartile range has two advantages over the standard 
deviation and the mean deviation ; it is calculated with great ease, and 
it has a clear and simple meaning. 

In almost all other respects the advantage lies with the standard 
deviation. The semi-interquartile range has no simple algebraical pro- 
perties, and its behaviour under fluctuations of sampling is difficult to 
decide. In all but the most elementary statistical work these are over- 
whelming disadvantages, and the use of the semi-interquartile range is not 
to be recommended unless the calculation of the standard deviation has 
been rendered difficult or impossible, e.g. owing to the employment of 
irregular class-frequencies or of an indefinite terminal class. 

Absolute measures of dispersion 

6.28 The three measures of dispersion we have been discussing have 
all been expressed in terms of the units of the variate ; e.g. the standard 
deviation of height-frequencies was found in inches, and the mean deviation 
of wage-frequencies in pence. It is thus impossible to compare dispersions 
in different populations unless they happen to be measured in the same 
units. 

For this reason some, statisticians have recommended the use of 
" absolute " measures of dispersion, which shall be pure numbers .and 
not expressible in some particular scale of units. Such measures would 
permit of comparison between populations of very different natures. 

It is easy to construct several coefficients of the kind required. The 
standard deviation and the mean deviation have the dimensions of the 
variate, and it is only necessary to divide them by another factor which 
has the same dimensions ; e.g. 

Mean deviation Mean deviation Standard deviation 

Mean Mode Mean 

are all of the required t 3 q)e. 

Co^icient of vaxiatkm 

6.29 The last-mentioned in the foregoing paragraph in a modified 
form is the only coefficient which has come into general use. We define 
the coefficient of variation, v, as 




. (6.13) 
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This coefficient is obviously rather unreliable if the mean is near to 
zero ; but provided the nature of the ratio is kept in mind the coefficient 
may be useful in comparing the variation of materials which emanate 
from populations of the same t}^. 

Reduction of frequency-distribution to absolute scale 
6.30 Comparability of form may, however, be reached in a different 
way ; that is to say, by regarding o itself as a unit and expressing other 
measures in terms of it. Thus, in the height distribution of Example 
6.2, o — 2 • 57 inches, or 1 inch =0 • 389 a. Hence the intervals are 0 • 389 o 
in width, and run: 57 x 0-389 a-, 58 x 0-389 o-, etc. ; i.e. 22-173 
22-562 CT-, etc. 

A distribution expressed in this way has unit standard deviation. foV 



1 

a‘N 



=1 


The distribution reduced to the scale of a may thus be regarded as 
expressed in " absolute ” units, and two distributions expressed in this way 
may readily be compared as regards form, but not as regards dispersion, 
for this has been made the same in the two cases. 


Deciles and percentiles 

6.31 We may conclude this chapter by describing briefly methods 
which have been much used in the past in lieu of the methods described 
in this and the preceding chapter. 

Instead of dividing the total frequency into 4 parts by quau’tiles, we 
may divide it into 100 parts by what are called percentiles. Or we may 
divide into 10 parts by deciles. The theory of these quantities is precisely 
analogous to that of the quartiles : there may, for instance, be certain 
indeterminacies in their exact definition which are removed by supple- 
mentary conventions ; they can be obtained by arithmetical or graphical 
interpolation ; and they have simple and obvious meanings. 

Quantities such as quartiles, deciles, etc., which divide the total fre- 
quency into a number of parts, are called quantiles or grades, and when we 
speak of the grade of an individual we mean thereby the proportion of the 
total frequency which lies below it. Conventionally, half the individual 
is regarded as lying above, and half below, the point determined by the 
variate value which it bears.. 

The (Ustribution curve 

6.XS The grades or quantiles may conveniently be found by a graphicad 
method which is an extension of that of 5.21. Against the variate-value 
as abscissa we graph as ordinate the cumulated frequency up to and in- 
cluding the corresponding variate-value. This is called the distribution 
curve. By reading off the ordinate corresponding to a given variate we 
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can find, approximately at least, the number of members of the population 
bearing that or a lower value. Similarly, by reading ofi the variate 
corresponding to a given ordinate we can find the quartiles, just as we 
found the median in 5.21. In figure 6.2 we show the distribution curve 
for the data of Example 6.2, with the lines corresponding to the median 
and the quartiles. Figure 5.3 is really an enlarged version of part of this 
curve. 

A somewhat similar form of graph (with the percentiles as abscissa and 
the variate as ordinate) was formerly in use and was known as Galton’s 
ogive. The curve was not, however, always shaped like an ogive. The 
distribution curve appears to provide a more natural method of representa- 
tion and a better name. The mathematical reader will recognise it as 
the graph of the integral of the frequency curve. 

6J3 An extension of the method of quantiles to the treatment of non- 
measurable characters has also become of some importance. For example, 
the capacity of the different boys in a class as regards some school subject 
cannot be directly measured, but it may not be very difficult for the 
master to arrange them in order of merit as regards this character ; if the 
bo}^ are then “ numbered up ” in order, the number of each boy, or his 



Fig. a.2.— DWiBRitioB curve for tiatarc 
(Same data as fig. 4.6, p. flS) 
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rank, serves as some sort of index to his capacity. It should be noted 
that rank in this sense is not quite the same as grade ; if a boy is tenth, 
say, from the bottom in a class of a hundred his grade is 9*5, but the 
method is in principle the same as that of grades or quantiles. The 
method of ranks, grades or quantiles in such a case may be a very serviceable 
auxiliary, though, of course, it is better if possible to obtain a numerical 
measure. But if, in the case of a measurable character, the quantiles 
are used not merely as constants illustrative of certain aspects of the 
frequency-distribution, but entirely to replace the table giving the 
frequency-distribution, serious inconvenience may be caused, as ^ 
application of other methods to the data is barred. Given the talUe 
showing the frequency-distribution, the reader can calculate not onW 
the quantiles, but any form of average or measure of dispersion that ht^ 
yet been proposed, to a sufficiently high degree of approximation. But 
given only certain quantiles such as the percentiles, or at least so few of 
them as the nine deciles, he cannot pass back to the frequency-distribution, 
and thence to other constants, with any degree of accuracy. In all cases 
of published work, therefore, the figures of the frequency-distribution 
should be given ; they are absolutely fundamental. 

Gild’s mean difference 

6.34 The Italian statistician Corrado Gini has proposed a measure of 
dispersion which at first sight seems to have certain advantages over the 
standard deviation. It is the mean of the differences (taken r^ardless 
of sign) of each possible pair of variate values exhibited by the population ; 
e.g., if the frequency of the value Xj is fj, the coefficient of mean difference is 

n - - 

,_i *1.1 [ 

or, if we regard each member as taken with itself, contributing nothing 
to the sum in (6.14) but increasing the number of pairs of values to N* 
instead of N{N—l), we have the coefficient of mean difference with 
repetition — 

. . (6.15) 


These coefficients are more difficult to calculate than the stuidard 
deviation or the mean deviation, but they have a theoretical attraction 
in that they dq>end on the differences of values between themselves and 
not on the spread about some arlntraxy point such as the mean or the 
median. They thus measure, in a smise, the intrinsic spread the 
population independoitly of an origin ol location. 
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A si mil a r property, however, is possessed by the standard deviation. 
Suppose that, in equation (6.15), we sought to obviate the difficulties 
of u^g absolute vdues by defining a new coefficient E by the similar 


expression. 

^2 j(xi-A*)Vy/»} . . (6.16) 

Since 

(*/ — x») * =*,•* -f X** —2xjXk 

and 



=N*s‘ 

we find 



=2(s»-i*) 

=20* (6.17) 

so that E is merely the standard deviation multiplied by y/2. This relation 
shows that, apart from the constant \/2, the standard deviation may be 


regarded as the root-mean-square of all possible pairs of difierences of 
the variate values. Such being the case, the mean difference of Gini 
loses most of its relative theoretical attraction, and as it is more difficult 
to calculate the balance of advantage remains with the standard deviation. 


SUMMARY 

1. The standard deviation <r is defined by 

where x is the deviation from the arithmetic mean, o* is called the 
" variance." 

2. The root-mean-equare deviation s about a point A is defined by 
vriiete I is the deviation fnnn A. 
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3. If M—A’Bzi, then 

4. For grouped data the variance should be corrected by subtracting 
h* 

— , where h is the width of the class-interval, provided that (a) the 

frequency is continuous, and (6) that it tapers off to zero in both directions. 

5. The s.d. is the minimum root-mean-square deviation. 

6. The mean deviation is defined as 

m.d.»ir( U 1 ). 

7. The m.d. is a minimum about the median. 

8. The quartiles are the values of the variate which divide the total 
frequency into 4 equal parts ; similarly, the deciles divide it into 10 equal 
parts and the percentiles into 100 equal parts. 

9. The quartile deviation, or semi-interquartile range, is defined as 

10. For symmetrical or moderately skew distributions, 

m.d.=0*8o and ^=0*67a approximately. 

11. For the majority of such distributions 99 per cent of the total 
frequency lies within a range of 6cr, 7-5 m.d. or 9^, 


EXERCISES 

6.1 Verify the following for the data of Table 4.7, page 82 (in continua- 
tion of the work of Exercise 5.1) — 


Standard deviation (uncorrected) 

Mean deviation .... 

Quartile deviation .... 
Mean deviation /standard deviation . 
Qnartiie deviation /standard deviation 
Lower quartile . . * . 

XJpper 


Stature in inches for adult males bom in 


England 

Scotland 

Wales 

Ireland 

2*56 

2*50 

2*35 

2*17 

205 

1*95 

1*82 

1*69 

1-78 

1*58 

1*48 

1*35 

0*80 

0*78 

0*78 

0*78 

0*89 

0*82 

0*82 

0*82 

85*55 

88*92 

85*08 

68*39 

89*10 

70*04 

87*98 

89*10 
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6.2 Find the standard deviation, mean deviation, quartiles and semi- 
interqnartOe range for the data in the last column of the table of Exercise 
4.6, page 100 (in Continuation of the work* of Exercise 5.3). 

Compare the ratios of mean and quartile deviations to the standard 
deviation with those stated in 6.22 and 6.25 to be usual for moderately 
skew distributions. 

6.3 Using, or extending if necessary, your diagram for Exercise 5.5, 
page 123, find the median and upper quartile for incomes subject to sur- 
or super-tax. < 

Find also the 9th decile (the value exceeded by 10 per cent of incomes 
only). 

6.4 Find the quartiles of the distribution of Australian marriages given 
in Example 6.3, and find the semi-interquartile range. 

6.5 Find directly the standard deviation of the natural numbers from 
1 to 10, and hence verify equation (6.10). 

6.6 Show that, for any distribution, the standard deviation is not less 
than the mean deviation about the mean. 

6.7 Show that, for a J-shaped distribution with the maximum frequency 
towards the lower values of the variate, the median is nearer to than 
to (?,. 

6.8. Find the mean and standard deviation of the following numbers 
(1) without further grouping, (2) grouping the numbers by fives (40-, 45-, 
50-, etc.), (3) grouping by tens (40-, 50-, etc.) — 

40, 43, 43, 46, 46, 46, 54. 56, 59, 62, 64, 64. 66, 66, 67, 67, 68, 68, 
69, 69. 69, 71, 75, 75, 76, 76. 78, 80. 82. 82, 82, 82. 82, 83, 84, 

86, 88, 90, 90, 91, 91, 92, 95, 102, 127. 

6.9 Apply Sheppard's correction to the standard deviations calculated 
in Exercises 6.1 and 6.2 above. 

6.10 (Continuing Exercise 5.9, p. 123.) Supposing the frequencies of 
values 0, 1 , 2, 3, . . . of a variable to be given by the terms of the binomial 
series. 

r. • . • 

where p+q^l, find the standard deviation. 

6.11 (Cf. the remarks at the end of 6.21.) The sum of the deviations 
(without regard to sign) .about the centre of the class-interval containing 
the mean (or median), in a grouped frequency-distribution, is found to be 
S. Find the correction to be applied to thk sum, in order to reduce it 
to the mean (or median) as origin, on the assumption that the obs^atiems 



150 


THEORY OP STATISTICS 


are evenly distributed over each dass-interval. Take the number of 
obsovations below the intervd containing the mean (or median) to be 
iti, ki i^t interval », and above it distance of the mean (or 

median) from the arbitrary origin to lx i. 

6.12 Show that if deviations are small compared with the mean, so that 
{xfM)* and higher powers of xjM may be neglected, we have approxi- 
mately the relation 



where G is the geometric mean, M the arithmetic mean and o the standard 
deviation: and consequently to the same degree of approximaticm 
M*-G**o*. 

6.13 Similarly, show that if deviations are small compared with the mean, 
we have approximately 

H being the harmonic mean. 

6.14 Find the coefficients of variation of the height distributions of 
Exercise 6.1 (using the uncorrected values of the 3.d. as given). 

6.15 Show that if a range of six times the standard deviation covers at 
least 18 class-intervals, Sheppard's correction will make a difference of 
less than 0*5 per cent in the uncorrected value of the standard deviation. 



CHAPTER SEVER 


MOMENTS AND MEASURES OF SKEWNESS 
AND KURTOSIS 


Moments 

7.1 In considering the calculation of the mean and the root-mean- 
square deviation we have defined, in passing, the quantities ^£(/ i) and 

\z'Z{f as the first and second moments about the value A, i being as 
N 

before the value X —A, i.e. the excess of the variate value X over the value 
A. The first moment about the mean is zero, and the second' moment 
about the mean is the variance (6.6). 

In generalisation of these definitions we now define the nth moment 
about A as /i/, where 

( 7 . 1 ) 

The moments about the mean, which are of particular importance, 
we write without dashes so that 

(7.2) 

From these definitions we have — 

/*o'=/‘o=^21(/)=l since g® and x®=l 

/»» 

Them tesults we have already sees. 

* 5 * 
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7.2 The word " moment ’* derives from Statics, and we may direct 
the attention of the student who is familiar with moments of forces to the 
fact that the sum S(/^») is divided by N in the definition above. This 
amounts to a slight departure from the Statical practice, and some writers 
refer to what we have called " moments " as “ moment-coefficients ” in 
order to keep this fact in mind. In Statistics, however, no confusion is 
likely to arise from the use of the briefer form " moments." 


Moments about the mean in terms of moments about any point 

7.3 We have, by definition, 

^=X-A ={X-M) +{M-A) 


Hence, 

and 




Now, by the binomial theorem. 

Hence, 

Dividing by N we get — 

. . -fd" (7.3) 

Similarly, 

s{A-)=s{/{£-d)»; 

and 

. . 4-(-l)"<f" (7.4) 


These useful relations express the moments about the mean in terms 
of those about an arbitrary point A, and vice versa. 

In particular we have — 

If «=1, 

■ /ti'=;ti-|-d=d from (7.3) 

d=0 from (7.4) 

which are simply the relation M —A —i in another form. 


If n=s2, 

/i,'=/t,-4-2d/tj-f-d* ■ from (7.3) 

-(-d* s=a* -f-d* 

/*, =/t,'-2d/t/+d* from (7,4) 

=/(,'-2d»-fd* 

These are the rdation /(,'«*o*+d*. 
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If «=3, 


If «=4, 
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“1“^^ * . 

H ~/^2'“’3i/42"+3(iV/“-^® 

=/A3' — 3i/t2' +2i* 


from (7.3) 

. . (7.5) 

from (7.4) 

. . (7.6) 


+4i/t, +6i*/t» +4<fVi +<f* 

+6i*/*«^ — 4<i*/ti' +(f* 
=|l^ —Aijif,' —Zi*‘ 


from (7.3) 
from (7.4) 


(7.7) 

(7.8) ‘ 


Calculation of moments 

7.4 The calculation of moments of the third and higher orders is similar 
to that of the first and second. For grouped data we regard the observa- 
tions as concentrated at the mid-points of the intervals ; we choose a 
convenient arbitrary origin A, find the moments about it and use the 
relations (7.3) and (7.4) above to find the moments about the mean ; we 
use a check on the aritlunetic similar to that of 6.11 ; and we have under 
certain conditions, certain Sheppard corrections for grouping. 

In practice we rarely require to ascertain moments higher than the 
fourth. Indeed, moments of higher orders, though important in theory, 
are so extremely sensitive to sampling fluctuations that values calculated 
for moderate numbers of observations are quite unreliable and hardly ever 
repay the labour of computation. 

7.5 There are various checks in use for the aritlunetic of calculation. 
We shall use a generalisation of the simple identities of 5.12 and 6.11 
In fact, we have 

(g+l)»=£>+3g»-h3g-H 

and hence, 

Sj/fg+l)*) =S(/g»)+3S(/£*)+3S(/g)-)-JV 

Similarly, 

S l/(g+l)*} =£(/£*) +4S(/£*)+6S(/f*)-f4E(/f)-f-i\f 
and so on. 

Thus, in calculating S(/f») we also find S|y(|-|-1)»}, and this, 
together with the sums of lower orders, will give us a ready check on the 
work. 

Example 7.1.— Continuing our work on the height distribution of 
Table 4.7, page 82, let us find the third and fourth moments of the 
distribution about the mean. 

In almost all practical work we require the first and second moments 
as a matter of course. It is therefore best to proceed systematically in 



«f flitt Cmk momati of the dMiibnUoa of heVit* of Tahk 4.7, p. 82 
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8.763 13.759 56.809 65.752 j 119.391 | 236.653 1,182,061 1.539.! 
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the computation of the various moments by setting out the arithmetic in 
tabular form as on opposite page. 

From this table we have — 

= 8,763 - 8.584= 179 

S(/g*) = 56,809 

S(/g») =119,391 -117,622= 1,769 

£{/£') =1,182,061 

As a check on L(/ P) we have — 

S(/g*)+3S(/f*)+3Z(/S)+N 

=1,769+170,427+537+8,585 

=181,318 

=S{/’{^+l)''} 

As a check on 2( / P) we have — 

S( fP) +4S(/|») +6S(/ J«) +4S(/g) +iV 
=1 182,061 +7.076+340,854+716+8,585 
=1,539,292 
=2{/(g+l)‘} 

We have then- 

<i=Ai'=i£(/f)=^= 0 020.850.32 

- w - « 

/*.'= = 0-206.057.08 

3^^ =137-689,108.91 

=6-616,805 

From equation (7.6)— 

A8=A»'-3<f;t,'+2i» 

=0 • 206,057,08 -0 • 4 1 3.914,67 +0 • 000,018, 1 3 
=-0-207,839 

From equation (7.8)— 

— 4i/»j' +6iVs* — 3<i* 

=137-089,108,91 -0-017,184,24+0-017,260,51 -0 000.000.57 
=137-689,185 

which g^ves us Ht, in units based on class-intervals, i.e. inches. 
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Example 7.2. — ^To find the moments about the mean of the distributioD 
of Australian marriages of Table 4.8, page 84. 

UntU the last stage we work in dass-intervals of 3 years. As in Example 
6.3, page 132, we take a working mean at 28*5 years. 

From this table we have — 

S(/g) = 318,049-229,217 = 88,832 

S(/g*) = 2.155,838 

S(/^) =13,675,105 -876,743= 12,798.362 

=137,306,162 

As a check on S{/g) we have — 

^f^) +iV =88,832+301,785=390,617 
=2:{/{£+l)> 

Similarly, for 2(/£*) — 

S(/i») +2i:(/^) +iV=2,155,838+177,664 +301 ,785 
=2,635.287 
=2:{/(f+i)*} 


As a check on 2(/g*) — 

2:(/|»)+32:(/f»)+32(/g)+iV 
=12,798,362 +6,467,514 +266,496 +301 ,785 
=19,834,157 
=S{/(f+l)»} 


As a check on S(/ f*) — 

Z(/f‘) +4E(/g*) +6S(/f «) +42(/D +N 
=137,306,162+51,193,448+12,935,028 +355,328+301,785 
=202,091,751 
=S{/(S+1)*} 


Hence, about the working mean — 


/*»' = 


88,832 

301,785 

2,155,838 

301.785 
12,798,362 

301.785 
187,306,162 

301.785 


0-294.355.253 
= 7-143,622,115 
= 42-408,873,887 
=454-980,075,219 
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For moments about the mean — 

-<i«=7 056,977 

=36- 151,595 

/«*=/»*' -ai«==408* 738,210 

These are expressed in class-intervals, which are tknits of three yean. 
If, as we rarely do, we wish to express the results in other um'ts, say one 
year, we must multiply the first moment by 3, the second by 3*, the thim 
by 3®, the fourth by 3*, and so on ; e.g. 

\ 

/i,=7 056,977x9=63-512.79 \ 

In this and the preceding example we have retained more digits than 
are probably necessary, but the student will find it as weU to retain several 
more than appear to be required, since subsequent work involving multi- 
plication or addition may otherwise throw doubt on the final figures. 

7.6 It will be evident that the labour involved in calculating the third 
and fourth moments is very conaderable. Calculating machines or 
tables of powers are a great help, and certain tables for the specific purpose 
of computing moments will be found in TtMes for StatisHcians and 
Btonutrkians, Pari t. The student should familiarise himself with the 
methods given in the two examples above, since, although we shall not 
use them to any great extent in this book, moments are important in 
more advanced theory. 

Sh^pwrd cocKctiou fw nioinieats 

7.7 As in the case of the second moment, the efiect doe to grouping 
at mid-points of intervals may be corrected for by formulse doe to W. F. 
Sheppard, from whom they derive their name. The formulse for the 
second, third and fourth moments are as follows— 

h* 

Ht (corrected)*/!,-— 

/(, (corrected)*/!, (7.9) 

/!, (corrected) */!*-i*»/!,+^h* 

where A is the width of the class-intervaL If we are workiag'in dass- 
iatcrvals as units, h is taken to be unity. 

The UM of these formuhe is restricted to the cases which we mentioned 
in 412, i.e. those in whidi (a) the frequency-disti^tion is continocNia, 
and (6) the distribution tapm off to zero in both directioiia. 
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Example 7.3. — In Example 7.1 we found — 

/i,=x 6-616,805 
/t8= -0-207,839 
/t4= 137 -689,185 

Appl)dng the above corrections, h being 1 — 

fit (corr.)= 6-616,805-0-083,333 
= 6-533,472 
//j {corr.) = — 0-207,839 

(corr.) = 137 - 689,185-3 - 308,402+0 - 029,167 
=134-409950 

Example 7.4. — In Example 7.2 we have, in units of 3 years — 

7-056,977 
/»,= 36-151,595 
=408- 738,21 

Thus — 

//* fcorr.)= 7-056,977-0-083.333 
= 6-973,644 
/i, (corr.)= 36-151,595 

(corr.) =408 • 738,210 -3 • 528,489+0-029, 167 
=405-238,888 

In units of one year the corrected moments are given by multiplying 
by 9, 27 and 81 as before. 


P~ and y-coeffidents 

7.8 Certain quantities calculated from the moments about the mean 
are of particular importance in statistical work. We define — 

. (7.10) 

. (7...) 

and two further quantities — 

71 =+^^ - • (7.12) 

(7.13) 

The reason for the introduction of these arbitrary-looking quantities will 
appear in the sequel.^ 


* In general, Karl Pearson defined 
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It is to be noted that these four coefficients are all pure numbers and, 
as such, are independent of the scale of measurement of the variable ; for 
since has the dimensions of (variable)”, /(,* has the dimensions (variable)* 
and so has and hence their quotient has dimension zero, i.e. is a pure 
number ; and similarly for the quotient of /tg and /i,*. 

Example 7.5. — ^Let us calculate and for the distribution of Example 

7.1. 

We have, using the corrected values of Example 7.3 — 



(-0-207839)* 
(6 •533472)* 


0-043197 

278-889 


0-000155 



134-40995 

42-68662 

=3-149 

Example 7.6. — Similarly, in the data of Example 7.2, using corrected 
values — 

. J36- 151595)* 

"(6-973644)* 

=3-854 

_ _405 - 238888 
(6-973644)* 

8-333 


It should be noted in this last example that, since the coefficients are 
pore numbers, it does not matter whether we work in units of three years 
or of one year. 


Measures of skewness 

7.9 The departure of a frequency-distribution from symmetry has a 
certain interest, and several measures have been devised to permit of the 
measurement of this skewness. Such measures should (a) be pore.numbers, 
so as to be independent of the units in which the variable is measured, 
and (6) be zero when the distribution is symmetrical. 

7.10 Three such measures deserve mention. In the first place, we can 
define 


Skewness: 

7Q 


- (7.14) 
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x6x 


This can be put in the form — 


Skewness = 




(7.15) 


i.e. the sjcewness is taken to be the difference of the quartile deviations from 
the median divided by their sum. It is clearly a pure number, for both 
numerator and denominator have the same dimensions, and it is zero when 
the distribution is symmetrical. It varies from —1 to +1.^ 

This is a rather rough-and-ready measure which might, however, be 
useful if we were using the semi-interquartile range as a measure of dis- 
persion and were unable or unwilling to calculate the standard deviation. 

7.11 The most common measure of skewness is Pearson’s, defined by 


Skewness = 


Mean— Mode 
Standard deviation 


M-Mo 


(7.16) 


This evidently is a pure number and is zero for symmetrical distribu- 
tions. 

7.12 The calculation of this coefficient of skewness is subject to the 
inconvenience of determining the position of the mode. We may circum- 
vent this difficulty in several ways. In the first place, for distributions 
which are obviously not too skew we may use the empirical relation 
of 5.27. We then have — 


3(Mean— Median) 
Skewness =—-i — - — — — : — r-^ 
Standard deviation 


(7.17) 


Secondly, for a large class of curves to wliich the moderately skew 
humped curve is a close approximation, the skewness of equation (7.16) 
is given exactly by 

Skewness— *V^^i(^ 2 '*i”^) m io\ 

Skewness- . (7.18) 

We may, therefore, take this to be an approximation to the value given by 
equation (7.16). 

It should be noted that the measures (7.14) and (7.16) are positive if 
the longer tail of the distribution lies toward the Wgher values of the 
variate (the right) and negative in the contrary case. This accords with 
the anticipatory remarks of 4.20. The measure (7.18) is to be regarded 
as without sign. 


* In the 10th and previous editions of this book the measure Skewnees—"!-— 

was suggested, i.e. twice the measure (7.14). The above form has the advantiwe that 
its limits are —1 and -J-l. 
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liolits the measures at skewness 

7.13 We have already remarked that the measure given by equation 
(7. 14) lies between — 1 and + 1 . There is no limit in theory to the measure 
^.16) or its approximation (7.18), and this is a slight drawback. But 
in practice the value given by equation (7.16) is rarely very high, and for 
moderately skew single-humped curves is usually less than unit;^. 

It has been showm that the quantity — - — — — — — hes between 

Standard deviation ; 

the limi ts —1 and -|-1, and the measure (7.17) therefore lies between —3 
and -4-3. In practice it rarely approaches these limits. \ 

Example 7.7. — Let us once again consider the height distribution of 
Table 4.7, which has been already discussed in this chapter (£xamples\7.1, 
7.3 and 7.5). 

We have — 

Mean (Example 5.1, p. 106) 

S.d. (corrected. Example 6.4, p. 134) 

Median (Example 5.3, p. 112) 

(Example 6.9, p. 141) 

Q {ibid.) 

/?, (corrected. Example 7.5, p. 160) 
fit 

The measure of skewness (7,14) is, then, 

2? 

_ 65 • 71 -1-69 • 21 - (2 X 67 • 47) 

2x1-75 

= -0-006 

We can clearly place no reliance on this figure. The median and 
quartiles were obtained by methods of approximation which we cannot 
expect to give accuracy to the second decimal place. We can only 
conclude, therefore, that so far as the measure (7.14) is con(»med, then 
is no agnificant skewness. 

The measure (7.18) gives — 

c- _ 0-0124x6-149 

2(15-745-0-001-9) 

_ 0-0124x6-149 
2x6-744 
« 0-006 

Here again the skewness u extremdly small, and is, in fact, almost 
equal to the value given by (7.14). 


=67-46 inches 
= 2-56 inches 
=67-47 inches 
=65-71 inches 
=69-21 inches 
= 1-75 inches 
= 0-000155 
= 3-149 
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If we take the measure (7.17) we get— 

a 

-0 03 
2-56 

= -0012 

This value is suspect because we have determined the mean and the 
median only to the second decimal place, but clearly the value is small. 

We conclude that there is only very slight skewness. At this stage we 
cannot say whether such small skewness is significant, but it is at least 
probably attributable to sampling fluctuations. 

Example 7.8. — For the marriage data of Examples 7.2, 7.4 and 7.6 
it will be found that, using the working mean as origin — 

Mean 0-2944 
Median —0*4018 
Qy = -1*4568 
= 1-2316 

and 

o (corrected) (Ex. 6.5) = 2-6408 

= 3-854 
^, = 8*333 

The measure (7.14) is — 

(§,-M.)+(Mi-(?,) 

_ 1-6334-1-0550 
1 - 6334 -f 1-0550 
0-5784 
2-6884 
= 0*22 

The measure (7.18) is — 

g. _ -v/3-854(ll-333) 

2(41-665 -23-124 -9) 

_ 1-963x11-333 
2x9-541 
as 1-17 

The two are very different, as we might expect, but both indicate 
strong positive skewness. As a matter of interest we may compare the 
value (7.17), which gives 

,,, 3x0-6962 


2-6406 

0‘79 
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Kurtosis 

7.14 The coefficient * or its derivative y, is used to measure a property 
of the single-humped distribution known as kurtosis {Kvprds, humped). 

We take as the standard value of y?* the number 3, for reasons which 
will appear when we study the so-called *' normal ” curve (8.24). This 
curve is approidmately of the shape given in fig. 4.5, paige 81. Curves 
with values of y?, less than 3 are called platykurtic (wAaTi/s, broad, 4- 
Kvpris). Curves with values greater than 3 are called leptokjurtic‘ 
{Xenrds, narrow, +KVfyr6s). “ Student ” gives an amusing mnemonic for 
these names : Platykurtic curves, like the plat 5 rpus, are squat with short 
tails. Leptokurtic curves are high with long tails like the kangaroo — 
noted for " lepping ” 1 \ 

Example 7.9. — In the height distribution of Examples 7.1. 7.3, >J.5 

y?j=3149 

r*=A-3=0I49 

Hence the curve is slightly leptokurtic. 

On the other hand, in the marriage distribution of Examples 7.2, 7.4, 
7.6 and 7.8— =8-333 

y, = 5-333 

and the curve is very leptokurtic. 


Cumulants 

7.15 We may conclude this chapter by referring briefly to a set of 
quantities similar to moments which have some theoretical and practical 
importance. These are the cumulants.* 

The cumulants are defined by a rather complicated mathematijpal 
expression which we shall not here reproduce. For present purposes it 
is sufficient to note that the first four cumulants may be expressed as 
simple functions of the first four moments. In fact we have — 


1 

2 *= P'1 
8 =/^8 — 3/«i 


' These terms are due to Karl Pearson and appear to have been given for tlie first 
time in Biometrika, 1905, 4, 169. By a slip leptokurtosis is there inadvertently applied 
to distributions for which 

It has often been stated that platykurtic curves are relatively more flat- topped and 
leptokurtic curves more peaked than the ** normal curve. This is the origin of the 
name and of Student's " mnemonic, and the assertion was made in the 13th and earlier 
editions of this book. It is, however, verj' difficult to justify in general. 

* These quantities were introduced into statistics by T. N. Thiele under the name of 
semi-invariants, the forms “ seminvariant ” and ** half-invariant ** also occurring in 
wlier litemture. The word " cnmulant *' is preferable and is now in general use, 
there bemg other families of quantities which also have the seminvariant property in 
the algehtilcal semse. 
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In particular, about the mean, 

Aj = 0 
~ /*2 
'fs =/«» 

^4 =/< 4 ~ 3 / 42 * 
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(7.20) 


7.16 These relations are used in the calculation of the cumulants, the 
moments being first ascertained in the manner of the earlier sections of 
this chapter. For instance, the first four cumulants of the height dis- 
tribution which has served us as an example are, about the mean, 

x, = 0 

xj = 6-616805 
/fj = -0-207839 

= 137-689185 -3X (6-616805)* =6-34286 
if we take uncorrected values of the moments. 


7.17 The cumulants have several remarkable properties. In the first 
place, all cumulants except the first are independent of the origin of 
calculation. The moment? vary according to the point about which 
they are calculated, which makes it necessary to specify the origin A 
in speaking of them. The cumulants, on the other hand, do not, so that 
it is unnecessary to specify any value A in giving their values ; the sole 
exception to this rule is the first cumulant, which is the same as the 
first moment. 

Secondly, if the scale of measurement of the variate is altered by 
multiplying all values by a constant a, the nth cumulant is multiplied 
by a". Thus, in the height distribution, if we change our scale to centi- 
metres instead of inches, and so multiply all values of the variate by 2*54, 
the cumulants in the previous section are to be multiplied by 2-54, 2-54*, 
2*54®, 2-54*, respectively. 

We shall also see in the next chapter that the cumulants take simple 
values for certain theoretical frequency-distributions of importance. 


SUMMARY 

1. The nth moment about the point A is defined as 

where ^—X—A, and X is the value of the variate. 

2. The nth moment about the mean is written fi». 
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3. /«• •C|rf/tVi+"C)I^V*'irl~ • • • +( — 

where 

d = M-A 

and in particular 

Aa ==/<8'— 3d/t,'+2(i* 

/t4 =fi,'-4d,i,’+6dW-3d* 

4. Sheppard's corrections for the moments are — 

A» 


12 


(corrected) —/tf 
(corrected) » /t. 


Mt (corrected) =/t4- +^*4 


5. 


A 


/*!* 




.. _ _ fl Q _ /‘4-3/«a* 

ri = ypi=-i!i y% = Aa-3 = — j 




/ta’ 


6. Pearson's measure of skewness is given by 

Mean— Mode 

Sk 

Standard deviation 

which, for a large class of curves, is equal to 

V^Gga+3) 

2(5A-6A-9) 

7. If the standard deviation is not known, a rough measure of skewness 
is obtained by taking 

Sk — 

2 <? 

8. Distributions for which /?,> 3 are said to be leptokurtic ; those for 
which /?,< 3 are platykurtic. 

9. The first four cumulants, in terms of the moments about the mean, 
are — 

<fi == 0 

*a*Aa 

*a«»A»a 

«4=/»4~W 

10. The cumulants independent of the ori|^ of calculation, enoept- 
Uie mA, wluch is equal t*> the mean. 
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EXERCISES 

7.1 Find the first four moments about the mean of the distribution of 
males in the United Kingdom according to weight given in Exercise 4.6., 
pj^e 100. (Correct your values for grouping.) 

Hence find fii and and measure the kurtosis of the distribution. 

7.2 For the same distribution find the three measures of skewness, 
approximating to the mode by the empirical relation of 5.27. 

7.3 Find the first four moments about the mean, the values of fit, 
and the three measures of skewness for the following distribution (see 
table below). (Apply Sheppard’s corrections.) 

7.4 In the data of Example 7.1, group the individuals by intervals of 
three inches (57-, 60-, etc.) and calculate the first four moments about 
the mean. Compare your results with those of Example 7.1, (a) before 
Sheppard's corrections are applied, and {b) after Sheppard’s corrections 
are applied. 

7.5 Find the third and fourth moments about the mean of the binomial 
series — 


q*, nq’*-^p, . . . where P+q=l 

I • Zi 

(continuing the work of Exercise 6.10, page 149). 


Data foe Bxerdse 7.3—4912 Cows daitUied according to tlMir yield of milk 

PaU !rom J. P. Tocher, ** An Invectigation of the Milk Yield of Dairy Cows,** 
BumOnkat 1928, MB, lOS.) 


Yield of milk 
(gallons per week) 
(Central value of 
interval) 

Number of 
cows 

Yield of milk 
(gallons per week) 
(Central value of 
inter\'al) 

Number of 
cows 

8 

1 

23 

214 

9 

5 

24 

153 

10 

13 

. 25 

112 

11 

33 

26 

58 

12 

71 

27 

35 

13 

151 

28 

13 

14 

236 

29 

15 

15 

339 

30 

4 

16 

499 

31 

5 

17 

552 

32 

2 

18 

585 

33 

1 

19 


34 

1 

20 

496 



21 

448 

Total 

4,912 

22 

284 
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7.6 The first four moments of a distribution about the value 4 are —1 '5, 
17, —30 and 108 ; find the moments about the mean and the origin. 

7.7 Show that for a symmetrical distribution all moments about the mean 
of odd order are zero. 

7.8 Show that for any distribution /?*>!. 

7.9 Calculate the second, third and fourth cumulants of the distribution 

of Australian marriages of Example 7.2, {a) from the moments atjout the 
mean, using equation (7.20), and {b) from the moments about the value 
28*5, using equation (7.19) ; and hence verify that the values\of the 
cumulants are independent of the origin of calculation. (Use uncollected 
values of the moments.) ' 

7.10 Show that 


d =Xj 
a =y/K, 




CHAPTER BIGHT 


THREE IMPORTANT THEORETICAL 
DISTRIBUTIONS 

THE BINOMIAL, THE NORMAL AND THE POISSON 

Thecnretical distriimtioiis 

8.1 In the examples of frequency-distributions which we have given 
in Chapter 4 and subsequent chapters we have been careful to take data 
from observation and experiment. It is possible, however, starting with 
certain general hypotheses, to deduce mathematically what the frequency- 
distributions of certain populations should be. Such distributions we 
shall call theoretical. 

8.2 There are three theoretical distributions which, from their historical 
interest as Vrell as their intrinsic importance, occupy a porition in the 
forefront of statistical theory. They are, in the order of their discovery, 
the Binomial (due to James Bernoulli, circa 1700), the Normal (due to 
Demoivre, but more often associated With the names of Laplace and 
Gauss, who discussed it at the close of the eighteenth and the beginning 
of the nineteenth centuries), and the Poisson (due to S. D. Poisson, who 
published it in 1837). 

These three are, so to speak, the classical distributions. Certain others 
were discovered during the nineteenth century, but it was not until the 
end of the century that there began the second period of statistical dis- 
covery which has since given us a wealth of theoretical distributions. Even 
this latest crop depends to some extent on the properties of the first three, 
and particularly of the Normal Distribution The three therefore form, 
historically and logically, the starting-point of the theory of particular 
distributions, and in this chapter we propose to give an account of their 
main properties. 

The Mnomial dlstrihuthm 

8.3 If we may regard an ideal coin as a uniform, homogeneous circular 
disc, there is nothing which can make it tend to fall more often on the 
one side than on the other ; we may expect, therefore, that in any long 
series of throws the coin will fall with either face uppermost an tqpproxi- 
mately equal number of times, or with, say. heads uppermost approximatdy 
half the times. Similarly, if we may regard the ideal die as a perfect 
homogeneous cube, it wUl tend, in any long series of throws, to fell 
with each of its six faces uppermost an approximately equal number «f 

c 



170 


THEOSY OF STATISTICS 


times, or with any given faee uppermost one^sixth of the whole number 
of times. These results are sometimes expressed by saying that the chance 
of throwing heads (or tails) with a coin is 1 /2, and the chance of throwing 
six (or any other face) with a die is 1/6. To avdid speaking of sui^ 
particular instances as coins or dice we shall in future, using terms which 
have become conventional, refer to an event the chance of success of 
which is p and the chance of failure q. Obviously p+q—\. 

8.4 We will now assume that the events in a number of tiials are all 

independent, i.e. that the chances p and q are the same for e^ch event 
and remain constant throughout the trials. The case correspond to the 
tossing of perfect coins or the throwing of perfect dice. \ 

Suppose now we take a number of sets of n trials and count the\number 
of successes in each set ,* for example, we might toss a coin ten t^es for 
each set, and observe the number of heads in each set of ten. In general, 
there will be some sets with no successes, some with one success, some with 
two successes, and so on. Hence, if we classify the sets according to the 
number of successes which they contain we shall get a frequency-dis- 
tribution. Table 4.15, page 96, gives such a distribution for some dice- 
throwing experiments. W'e shall now see how, on the assumption of 
independence of successive events to which we have just referred, the 
nature of this distribution may be theoretically determined., 

8.5 For the case of single events we expect in iV trials to get tip successes 
and Nq failures. 

Suppose now we take N pairs of events, i.e. two to the set. There will 
be Nq cases in which the first event is a failure, and, in virtue of the in- 
dependence of the events, among these Nq there will be iVg x g failures, and 
Nqxp successes, of the second event on the average. Similarly, of the Np 
cases in which the first event was a success, the second event will, on the 
average, be a success in Np xp and a failure in Np x q cases. Hence there 
will be Nq* cases in which both events are failures, 2Npq cases with one 
success and one failure, and Np* cases in which both are successes. 

If we now take N sets of three events, we see that, of the Nq* cases in 
which the first two events were failures, Nq*xq will give a third failure 
and Nq* xp one success ; of the 2Npq cases, 2Npq* will give two failures 
and a success and 2Np*q one failure and two successes ; and of the Np* 
cases, Np*q wiU give one failure and two successes and Np^ will give three 
successes. Hence the number of sets with 3 failures, 2 failures and 1 
success, 1 failure and 2 successes, and 3 successes are, respectively, 

Nf, 3Nq*p, 3Nqp*, Np* 

8.6 From these results it is evident that the frequencies of 0, 1, 2, . . . 
successes are given 

for one event by the lunomial expansion of N{q+p) 
for two events „ „ „ 

fm three events „ „ „ 
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In general, for m events the frequencies of successes in N sets are given 
by the successive terms in the binomial expansion of N(q+p)\ i.e. 

. . .} 

This is the so-called binomial distribution. 

Example 8.1. — If we take 100 sets of 10 tosses of a perfect coin, in 
how many cases should we expect to get 7 heads and 3 tails ? 

Here p—^. 

Hence, the numbers of successes 0, 1 , . . . 10 are the terms in 100(J-fi)“. 

The term giving 7 successes and 3 failures is — 

100x»«c,(*)’{i)* 

1.2.3 2^® 


=12 approximately. 

Example 8.2.— In the previous example, in how many cases should 
we expect to get 7 heads at least ? As before, the numbers of successes 
axe the terms in 


• • • 1 


We require the sum of terms with 7, 8, 9, 10 successes. Our expected 
number is, then, 

^{»®C,-|-»®C,+»®C,-1-1®C»,} 


>[10.9.8 10.9 , 10 
't 1.2.3'^ i.2‘’’ l"^ ) 


;{ 176 } 


>17 approximate. 
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Gcnenl tom of Uw UnoniiRl distribatfam 

8.7 The form of the binomial distribution depends (1) on the values 
of p and g, (2) on the value of the exponent n. 

If p and q are equal the distribution is evidently symmetrical, for p 
and q may ^ interchanged vrithout altering the value of any term, and 
consequently terms equidistant from the two ends of the series are equal. 

If, on the other hand, p and q are unequal, the distribution is skew. 
The following table shows the calculated distributions for n=20 and 
values of p, proceeding by O*!, from 0* 1 to 0*5. When p=0'l, cases of 
two successes are the most frequent, but cases of one success \ almost 
equally frequent : even nine successes may, however, occur about once 
in 10,000 trials. As p is increased, the position of the maximum frequency 
gradually advances, and the two tails of the distribution become\more 
nearly equal, until ^=0*5, when the distribution is symmetrical^ Of 
course, if the table were continued, the distribution for ^=0’6 would be 
ijqmilar to that for f =0-6, but reversed end for end, and so on. 


TABLE 8.1 — Tcmis of the Unomial Mrlct 10,000 (9+^)** for values of p from 0*1 to 0*8 

(Figures given to the nearest unit) 


Number of 
successes 


/»0-2 

}=0-8 

^«0*3 

^=^0-7 

f»«0*4 

^-0*6 

m 

0 

1.216 

115 

8 



1 

2,702 

576 

68 

5 

— 


2,852 

1,369 

278 

31 

2 


1.901 

2,054 

716 

123 

11 


898 

2,182 

1,304 

350 

46 

5 

319 

1,746 

1,789 

746 

148 

6 

89 

1,091 

1,916 

1,244 

370 


20 

545 

1,643 

1,659 

739 


4 

222 

1,144 

1,797 

1,201 


1 

74 

654 

1,597 

1,602 


— 

20 

308 

1,171 

1,762 

11 

— 

5 

120 

710 

1,602 

12 

— 

1 

39 

355 


13 

— 

— 

10 

146 

739 

14 

— 

— 

2 

49 

370 

15 

— 

— 

— 

13 

148 

16 

— 


— 

3 

46 

17 

— 

— 

— - 


11 

16 

19 

— 

— 

— 

— 

2 

20 

— 

— 

— 

•— 

! 


8.8 If P^, the effect of increasing n is to raise the mean and increase 
^ dis^rsion. If is not equal to q, however, not only does an increase 
in f» raise the mean and increase the dispersion, but it also lessens the 
ai^nninetiy ; the greater n, for the same values of p and q, the less the 
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asymmetry. Thus, if we compare the first distribution of the above table 
with that given by «=100, we have the following— 


TABLE 8.2 — ^Tennt of the Unomlal lerlce 10,000 (0-9-fO*l)^ 

(Figures given to the nearest unit) 



8 10 a J4 X 18 

Ntmhtt of successes 


i ta|iw a c yf ei igP HltteMaBBiiaI(y‘»+0«l)o|>>c v a «l »wtaiMio>a 
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The maxiinuin frequencies now occur for 9 and 10 successes, and the two 


" tails " are much more nearly equal, 
to 2, the distribution is — 

Number of 
successes 

0 

1 

2 


If, on the other hand, n is reduced 


Frequency 

8,100 

1,800 

100 


and the maximum frequency is at one end of the range. 

The tendency towards S 5 mametry may be seen from fig. 8.1, in which 
the binomial (0*9+0’l)" has been drawn for various values of p. See 
also 8.12 below. 


Constants of ttie binomial distribution 

8.9 We proceed to find the lower moments of the distribution lV(y+^)*. 

Taking an arbitrary origin at 0 successes, we have the successive 
deviations ^ as 0, 1, 2, . . . n, and hence, 

/h'=(?"X0)+("Ci?"-Vxi)+("Cs9""*^*x2)+ . . .+(^X») 
l)j**~*^+ • . • 
l)j«-»/>-f . . . 

—np{q-^P)*-^ 


Now, €+^=1 

Hence, 

That is, the mean M is np. 

We have, further, 

As'=(?*X0)+("Ci?*-Vxl.)+("Crf^V*x2*)+ . . . +{^xn*) 
~np{<f>-^ -f 2(« -l)g»- *p -f §1 ? . . , +n^-»} 

The expression in brackets is the first moment of the binnmial (q+p)^' 
about origin —1, and hence is equal to («— 1) ^+1. 

Hence, 

t^t’^«P{{n-\)p+\} 

It may also be shown in a similar way (but we omit the proof) that 
1)(« — 2)^*-l-3(n — 1)^+1} 
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8.10 From these results we may find the moments about the mean 
We have — 

(>f — 1)^ -j- 1 ^ 

=np{l -p) 

^npq 


Hence we have the important result that — 


°=Vnpq . . . , 

, . (8.1) 

8.11 Similarly, it will be foimd that — 


fit^npq{q-p) . 

(8.2) 

/t4=3^*y*n*+^jn(l —6pq) 

(8.3) 

Hence, 


‘ /**• »P9 

(8.4) 

1 1-^ 

#«« 

(8.5) 


8.12 Thus the binomial distribution has mean np and standard deviation 
^npq. It is instructive to note that and (/?|— 3) are both of order 1 

fl 

Hence, as n becomes larger, the distribution tends to S 3 anmetry and 
zero kurtosis. 

The values of and fit lor some values of p and q and ranges of n ate 
shown in Tables 8.3, 8.4 and 8.5. 

From an inspection of these tables it will be seen that even for an 
extremely small value of p the binomial tends to zero and zero kurtosis 
for values of n well within practical limits. For the symmetrical binomial 
^asjsO’S, is of course zero, and rapidly approaches 3. 


TABLE 8.3.— Valms of and for Hie Unomlal wlUi ^>°0 02, f 0-98 
(From M. Greenwood, Biamdrika, 1913, 9, 69.) 


n 

fix 

fix 

100 

0-4702 

3-4502 

200 

0-2351 

3-2251 

300 

0-1567 

3-1501 

400 

0-1176 

3-1126 

500 

0-0940 

3-0900 

600 

0-0784 

3-0750 

700 

0-0672 

3*0643 

600 

0-0588 

3*0563 

900 

0-0522 

3-0500 

1,000 

0*0470 

3*0450 
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TABU 8A^Vihic$ tX pttnfbt Usomlal wMi />-0-l, f-0-9 


n 

A 

A 

100 

0 0711 

3*0511 

200 

0-0356 

3*0256 

1,000 

0-0771 

3*0051 


TABLE 8.5.— Vilttcs of for the Unomial %vith g^0*5 

n 


4 

2*5 

6 

2*6667 

8 

2*75 

10 

2*8 

50 

2*96 

100 

2*98 

1,000 

2*998 


Mechanical representation of the binomial distribution 
8.13 There is an interesting mechanical method of constructing a repre- 
sentation of the binomial series. The apparatus, which is illustrated 
in fig. 8.2, consists of a funnel opening into a space — say a J inch in depth 
— between a sheet of glass and a back-board. This space is broken up by 

successive rows of wedges like 1 , 2 3, 
4 5 6, etc., which will divide up into 
streams any granular material such as 
shot or mustard seed which is poured 
through the funnel when the apparatus 
is held at a slope. At the foot these 
wedges are replaced by vertical strips, 
in the spaces between which the 
material can goUect. Consider the 
stream of material that comes from 
the funnel and meets the wedge 1. 
This wedge is set so as to throw q parts 
of the stream to the left and p parts 
to the right (of the observer). The 
wedges 2 and. 3 are set so as to divide 
the resultant streams in the same 
proportions. Thus wedge 2 throws 
q* parts of the original material to the 
left and qp to the right, wedge 3 throws 
pq parts of the original material to 
the left and p^ to the right. The 
streams passing these w^es are 
therefore in the ratio of : 2qp : pK 
Mg. 8.2.->Thc row of wedges is again set 

blmnniaX appantiis so as to divide these streams in the 
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same proportions as before and the four streams that result will bear the 
proportions : p^. The final set, at the heads of the vertical 

strips, will give the streams proportions : ^^p : Qq^p ^ : : />♦, and these 

streams will accumulate between the strips and give a representation of the 
binomial by a kind of histogram, as shown. Of course as many rows of 
wedges may be provided as may be desired. 

This kind of apparatus was originally devised by Galton in a form 
that gave roughly the symmetrical binomial, a stream of shot being 
allowed to fall through rows of nails, and the resiiltant streams being 
collected in partitioned spaces. The apparatus was generalised by Karl 
Pearson, who used rows of wedges fixed to movable slides, so that they 
could be adjusted to give any ratio oi q : p, 

8.14 It must not be forgotten that although we have spoken in 8.12 of 
the skewness and kurtosis of the binomial distribution, it is essentially 
discontinuous. This is a serious limitation. 

Consider, for example, the frequency-distribution of the number of male 
births in batches of 10,000 births, the mean number being, say, 5,100. The 
distribution will be given by the terms of the series (O* 49 +0-51)^®*®®®, and 
the standard deviation is, in round numbers, 50 births. The distribution 
will therefore extend to some 150 births or more on either side of the mean 
number, and in order to obtain it we should have to calculate some 300 
terms of a binomial series with an exponent of 10,000 1 This would not 
only be practically impossible without the use of certain methods of 
approximation, but it would give the distribution in quite unnecessary 
detail : as a matter of practice, we should not have compiled a frequency- 
distribution by single male births, but should certainly have grouped our 
observations, taking probably 10 births as the class-interval. We want, 
therefore, to replace the binomial polygon by some continuous curve, 
having approximately the same ordinates, the curve being such that the 
area between any two ordinates and y, will give the frequency of 
observations between the corresponding values of the variable and x*. 

Limiting form of the binomial for large n 

8.15 When n becomes large, each term of the binomial becomes small. 
We are, however, concerned with the sum of the terms falling within 
certain ranges, and these will not be small in general. 

Let us consider first of all the case when p and q are equal. The terms 
of the series are — 




The frequency of m successes is 


n 1 

m 1 {n—tn) I 
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and the frequency of m+1 successes is derived from this by multiplying 
it by (n—m) /{m+l). The latter frequency is therefore greater than the 
former so long as 


or 


m<- 


Suppose, for simplicity, that n is even, say equal to 2k ; then the frequency 
of k successes is the greatest, and its value is I 

• • • \- ( 8 - 6 ) 

The polygon tails off symmetrically on either side of this greatest ordinate. 
Consider the frequency of k-\-x successes ; the value is 




{2k) I 


and therefore 


(k+x) \ (k-x) I 


(8.7) 


y,_(A)(A-l)(*-2) . . . {*-;r+l) 

JVo (A+1)(^+2)(A+3) . . . (*+*) 

(-J)(-I)(-f) • ■ ■ 

Now let us approximate by assuming that k is very large, and indeed 
large compared with x, so that {x/k)* may be neglected compar^ with 
(x/k). This assumption does not involve any difficulty, for we not 
consi der va lues of x much greater than three times the standard deviation 
Or SVk 12, and the ratio of this to A is 3 /V2A, which is necessarily small 
if A be lai^e. On this assumption we may apply the logarithmic series 

S* S* 

log.(I+d)-d-i+|-^+ . . 



to every bracket in the fraction (8.8), and neglect all terms be 3 roBd the 
fost. To this degree of approximation, 

log>^»_?(l+2+3+ . . +7:^1) -I 

y# * k 

_ *(*-1) * 

A “A 

“■“a 
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Therefore, finally 

-t _ *!. 

, (8.9) 

where, in the last expression, the constant k has been replaced by the 
standard deviation a, for o*=A/2. 

8.16 The case when p is not equal to q may be treated in a somewhat 
similar way but is slightly more complicated. 

As before the frequency of m successes is 


Nx^Cmq^’^ 
n I 


--N 




«»1 {n—m) I 

The frequency of (»»+l) successes is derived by multiplying this 


expression by 


n—m p 
«+T'j 


or 


, and hence is greater than the former if 

m+r? 

m<np—q 


Let us assume that is a whole number. Since n is going to tend 
to infinity, this really imposes no limitation on our work. 

The maximum frequency is, then. 




nl 


(np) f («?) 

The frequency of pn +x successes is 


.^p"P 


( 8 . 10 ) 


Hence, 






ft ! 

{np-\-x)\{nq-x)V ^ 

(8.11) 

{np-\-x)[(nq-x)V ^ 

(ai2) 


Now, by an important theorem due to James Stiriing (1790), if h be Uutge, 
we have approximately 


n 1 



x8o 


THEORY OF STATISTICS 


Appl 3 ang this formula here — 

ys__ V 2npn{np)'*e-"i>V 2nqn{nq)"f6-"tp* 

yo V 2{np ■^x)n[np 2(nq—x)n(nq —*)*«- 

which reduces to 

1 

^0 


«-*+t 


Hence, 


\ «/»/ \ m) 

log« {«/>+^+i) log, ^i-f^y-(«Y-.Ar+J) log, ^1 

. . .'j 

■ ■ ■) 


After a little rearrangement this becomes- 


log. 




.**(^*+ 9 *) 9 -^.^ 8 


-LZ-K—Ll..' — J—£x +- -—X 

2npq 4n^p*q* 2npq Qn^p^q^ 


4- terms of order — and higher 
tr 


1 


Since q-\-p=l. we have, neglecting the terms of order — and higher, 

ny 

which are small compared with the others when n is large — 

^ \>o/ 2npq‘^ 4n*p*q* 'Q.np^ ^Pl) 

Put, as before, npq=a*, where a is the standard deviation of the 
binomial. If n be large, the second term is small compared with the first. 

Further, since we need not consider values of much greater than 3, 


if 


q-p 


be small, we can neglect the whole of the third term. On these 


Vnpq 

assumptions we have — 


or 


y» 


yx=y^ 


2o* 

Z(,> 


. ( 8 . 14 ) 


as before. 
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i8z 


q — p 

The expression — == is merely V fii, and so we have in effect simply 

ynpq 

assumed small ; however much p and q differ we can always make 
as small as we please by increasing n sufficiently. 


8.17 Hence, whether or not p is equal to q, the binomial distribution 
tends to the form' of the continuous curve {(8.9) and (8.14)) when n 
^comes large, at least for the material part of the range. As a matter 
of fact, the correspondence between the binomial and the curve is sur- 
prisingly close even for comparatively low values of n, provided that 
p and q are fairly near equality. The student may care to draw the curve 
with the aid of the tables given at the end of this book (see below, 8.26) 
and compare it with some of the simpler binomials drawn to the same 
scale. 


8.18 The curve 


is called the normal curve. A population classified according to a con- 
tinuous variate whose ideal frequency-distribution is a normal curve is 
called a normal population. 

The applications of the normal curve are by no means limited to dis- 
tributions of the binomial type. Before we refer to its many practical 
and theoretical applications, however, we shall give a short account of 
its main properties. 

Properties of the normal curve 

8.19 The normal curve is obviously symmetrical about the point *=0, 
for its equation is independent of the sign of x. At this point the 
ordinate lias its maximum value. The mean, the median and the mode 
coincide, and the curve is, in fact, that drawn in fig. 4.5, page 81 , and taken 
as the ideal form of the symmetrical curve. 

8.20 The curve is specified completely by defining the mean (the origin 
of x), the standard deviation a and the value 

In actual practice, as, for example, when we are tr 3 dng to fit a normal 
curve to given data, we are not given y, itself, but have to calculate it 
from the fact that the area of the curve must be equal, on the chosen 
scale, to the total number of observations. For this reason we wish to 
find the area under the curve 




*• 

2ot 
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8^1 From 4.14 it will be seen that the area of a histogram, that is to 
say, the total number of observations which it represents, is given by 

r""n 

Area= E {fr)xh 

r-l 

where h is the width of the interval, fr is the frequency in the rth interval 
and there are n intervals. 

As the histogram tends towards the continuous curve the width of the 
intervals becomes smaller and the number of terms in the simmatioA 
becomes larger. For the normal curve, which extends to inWty on 
either side of the mean, the limit to which the sum tends as the intervals 
become indefinitely small and the number of terms indefinitely \large is 
written \ 

f* 

1 . .>' 0 ^ 

the sign / being a conventional form of the summation sign S and dx 
representing the infinitesimally small value of h. 

This is the notation of the integral calculus, and the quantity J F{x)dx 

is said to be the integral of F{x) with respect to x between the limits —a 
and +6. In this book we shall not use the methods of the integral calculus, 
and accordingly it will be necessary for us to state certain results without 
proof. It will be sufficient if the student bears in mind that the process of 
integration is one of proceeding to the limit in cases of straightforward 
summation with which he is already familiar. 


8.22 The area of the curve 



is then 

y=y9f 

and this is equal to 

f* 

J..W "^dx 

Hence the curve 

X ■\/2>r==2-506627y^ 


1 _*L 

y ^ 

C‘\/2fr 


has unit area, and for this reason the equation of the normal curve is usually 
written in the standard form 




«• 

■ST 


. ( 8 . 15 ) 
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From this the form corresponding to a distribution of any given frequency 
is immediately written down. In fact, if the frequency is N, the corre- 
sponding normal curve is 


N - 

y= =e 

a‘\/2n 


2a* 


. (8.16) 


Constants of the normal curve 

8.23 The mean of the curve is, as we have seen, located at the origin. 
If we wish to write the curve with reference to some other point as origin, 
we can do so in the form 


y 





a‘\/2n 


. (8.17) 


where m is the excess of the mean over the value chosen as origin. 

The standard deviation of the curve is a, and the variance is accordingly 
a*. 

The higher moments are calculated by the processes of the integral 
calculus. Since the nth moment about the mean is given by 

/»-=2(/x») 

we have, proceeding to the limit, that the nth moment of the normal curve 
is 

If* -£L 

ftn= — 7=1 ^dx 

a‘\/2nj -» 


If n is odd this vanishes, as it most for any symmetrical curve, 
we have — 


and hence, 


ftn= 


nl 


2*«(Jn)l 


a" 


/*«= 


4.3.2 

'2.2.2'' 


‘o*=3o* 


If M is even 
. (8.18) 


. (8.19) 


8.24 From these results it follows that — 

A=ri=0 \ 

y,=0) 


( 8 . 20 ) 


i.e. the normal curve has zero kurtosis. This is, in fact, the origin of the 
choice of the apparently arbitrary value 3 in the definitions of platy- and 
lepto-kurtosis (7.14). 

We may also state without proof the important result that all cumulants 
of the normal curve of orders higher than the second vanish ideirtically. 
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8JS5 The mean deviation of the nonnal curve is — 

I -=0-79788 ... a 

This is the origin of the rule given in 6.22, that the mean deviation is 
approximately { of the standard deviation. The result is true of the 
normal curve, and very approximately true of curves which do not differ 
markedly from the nonnal form. The rules that a range of 6 times the 
standard deviation includes the great majority of the observations (6.13) 
and that the quartilfe deviation is about | of the standard deviation (6.25) 
were also suggested by the properties of the normal curve (see\ below, 
8.28 and 8.29). 


Ordinates of the normal curve 

8.26 The normal curve is so important that tables have been prepared 
to give (1) the ordinate of the curve corresponding to any given value 

1 

of X, i.e. the values of““=^ ^ , and (2) the areas of the curve to the 
-y 27r 

1 f« -f! 

right and the left of any given ordinate, i.e. the values of — < *dx 

v27rj X 


1 f* -i! 

c *dx. 
/27rJ_, 


Table 1 of the Appendix gives the values of the 


ordinate for values of x proceeding by steps of one-tenth of the standard 
deviation. The values are, of course, the same for positive as for negative 
values of x. More extended tables will be found in Tables for Statisticians 
and Biometricians, Part I. 

The ordinate of any nonnal curve corresponding to a specified value of 
the variate is easily obtained from the table, as may be seen from the 
following example — 

Example 8.3. — To find the ordinate of the normal curve given by — 

10,000 -i! 
y= — e M 
^ Ay/271 

corresponding to the variate value x=7. 

Here 

1V=10,000, a=4 


Altering the value of a is equivalent to altering the scale of x. The 
ordinate in this curve corresponding to x =7 will be the same as the ordinate 
of the curve of unit s.d. corresponding to *= -J =1*75. 

From Appendix Table 1, when 


*=1.8 y =0 07895 
*=1.7 y=0'09405 



THREE THEORETICAL DISTRIBUTIOHS 


185 


Hence, by simple interpolation, when 

«=l-75 y =0-08650 

The ordinate is 10,000 /4 times this, i.e. is equal to 216. This is accurate 
to the nearest unit. 


Area of the normal curve — the probability integral 
8.27 A table of the areas of the normal curve cut off by ordinates at 
specified values of * is given in Table 2 of the Appendix. As in the case 
of the table of ordinates, this table is applicable to all normal curves, 
whatever the value of their standard deviation, the areas cut off on 

1 -- 1 
y=--j^e ® by ordinates at* being the same as those cut off on ^ 

by ordinates at More extended tables will again be found in Tables for 
or 

Statisticians and Biometricians, Part I. 

The area of the normal curve to the left of the ordinate at x or, it may 
be, between the ordinates at 0 and x — conventions differ — ^is sometimes 
termed the probability integral or the error function. These names arise 
from the use of the function in the theory of sampling and the theory 
of errors respectively. 

Example 8.4. — Find the frequency represented by the smaller area of 
10 000 

the curve y= — cut off by the ordinate at x—1. 

4V2jr 


Here 


ff=4, ?=l-75 


For ^=1- 75 =1-5 +0 • 25 the table gives the value 0 • 9599. Hence the 
o 

smaller fraction equals 1 —0-9599=0-0401 and multiplying this by 10,000, 
we have the frequency represented, i.e. 401. 

Example 8.5. — A hundred coins are thrown a number of times. How 
often approximately in 10,000 throws may (1) exactly 65 heads, (2) 65 
heads or more, be expected ? 

The number of heads is given by the terms in 


10,000(i+i)i«> 

N 

The standard deviation is VO'SxO-Sx 100=5, —=2,000, and the 

exponent is large enough for us to be aide to take the distribution as 
normaL 
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The mean number of heads is 50, and 65— 50^39. The frequency of a 
deviation of So is given at once by Appendix Table 1 as 2,000 x 0*00443 
ae8’86, or nearly 9 throws in 10,000. A throw of 65 heads will therefore 
be expected about 9 times. 

The frequency of throws of 65 heads or more is given by Appendix 
Table 2, but a little caution must now be used, owing to the discontinuity 
of the distribution. A throw of 65 heads is equivalent to a range of 
64 ■ 5-65 * 5 on the continuotis scale of the normal curve, the division tetween 
64 and 65 coming at 64*5. 64*5— 50=-|-2*9o, and a deviajtion of 
+2*9o or more will only occur, as given by the table, 187 times in 1|00,000 
throws, or, say, 19 times in 10,000. 

8.28 From the table of areas we can find approximately the p^tikm 

X * 

of the quartiles. In fact, we require the value of - which will give us 0*75 

as the greater fraction of the area. From the table we see that this value 
must lie between 0*67 and 0*68. Simple interpolation gives 


0*67+0 01^ *0*675 


a more exact result is 

Quartile deviation s=0*67448975o . . (8.21) 

This is the origin of the rough rule that the semi-interquartile range is 
usually about ) of the standard deviation. y- 

8J29 We also observe from the table that an ordinate 3o from the mean 
cuts oS an area 0*99865 of the whole. The smaller fraction left is therefore 
0*00135 of the whole. Since the curve is symmetrical, it follows that 
a range of 3o on each side of the mean will cut ofi all but twice this, i.e. 
all but 0*00270 of the whole. This again is the origin of the rule that 
such a range includes the great majority of the~ observations. 


The nonnal disbrilmtkKi as an enor distrUmtlini 

8.30 We have deduced the normal distribution as a limiting form of 
^ l^omial distribution when n, the exponent, is large. This, however, 
is only one of the ways in which the nimnal curve occurs in statistical 
literature, and Gauss was led to it by a totally different line of reasoning, 
via. by inquiring what law of distribution errors of observation should 
obey iu order to make the arithmetic mean of a set of measurettients the 
most likdy value of the " true " magnitude. 
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8.31 Sui^>ose we take a population of measurements of some magnitude, 
and consider the population of deviations from the true value. Let us 
further suppose that any deviation is the result of the operation of an 
indefinitely large number of small causes, each producing a small perturba- 
tion. Let us assume that the small perturbations are all equal, and that 
positive and negative perturbations are equally likely. 

Then it may be shown that the distribution of errors x about the true 
value (taken as zero) is given by the law — 



For, if d is the amount of the perturbation, and positive and negative 
perturbations are equally likely, the expected frequency of m positive 
errors and « — m negative errors in N observations is the term 
in iV(J-l-J)", and the actual error is mS—(n—m)S=(2m—n)S. Similarly, 
the frequency of the actual error {2(»»+l)— is given by the teym in 
; and so on. Proceeding to the limit, as n becomes large, 
we get the stated result precisely as for the limiting process of 8.15. 


8.32 In the theory of errors it is more customary to write — 



so that the distribution becomes — 

. ( 8 . 22 ) 

y/n 

h is called the " precision " (cf. 6.17). As h increases, the normal curve 
becomes narrower and hence h measures in a sense the closeness of the 
bulk of observations to the true value. 

The occurrence of normal distributions in nature 

8.33 It was found at an early date that error distributions followed 
the normal law more or less closely, though it must be admitted not with 
any great exactitude. The fact that many populations, particularly bio- 
metrical populations such as those classified according to height and weight, 
lie distributed round the mean in a humped curve which is not unlike the 
normal curve, gave rise in the first half of the m’neteenth century to keen 
interest. Although the term " normal ” had not then been applied, there 
appears to have been a feeling that the curve vras the ideal to which most 
distributions should in some degree attain, and that an explanation was 
demanded if they did not. The normal curve was, in fact, to the eaily 
what the circle was to the Ptolemaic astronomers. 
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8.34 Workers during the latter half of the nineteenth century were 

more careful not to let their theories outrun their facts, and as the data 
accumulated it became evident that the normal distribution was no more 
usual than any other t 5 rpe. In fact, rather the reverse, so that the occur- 
rence of a normal distribution was to be regarded as something abnormal. 
" The reader may well ask,” said Karl Pearson, " is it not possible to find 
material which obeys within probable limits the normal law ? I reply, 
yes, but this law is not a universal law of nature. We must hunt for 
cases.” i 

The belief in the validity of the normal law in the theory of errors died 
harder. " As M. Lippmann once said to me,” says Poincar6, in his Calctd 
des ProbabiliUs,” " Everybody believes in the law of errors, the texperi- 
menters because they think it is a mathematical theorem, the Mathe- 
maticians because they think it is an experimental fact.” 

8.35 One must, however, be careful not to go too far in seeking to avoid 
an over-emphasis on the practical occurrence of the normal curve, A 
certain number of distributions, more particularly those relating to 
measurements on plants and animals, are approximately of the normal 
form. As an example, we may take the distribution of Table 4.7, which 
we show in fig. 8.3 fitted with a normal curve. 

Place of ttie normal curve in theory 

8.36 Strangely enough, the realisation that the normal distribution 
did not correspond to any widespread natural effect did not diminish its 
importance in statistical theory. On the contrary, the normal distribution 
has increased in importance in recent years. It is instructive to consider 
why this is so. 

In the first place, the normal curve and the normal integral have 
numerous mathematical properties which make them attractive and com- 
paratively easy to manipulate. We have, for instance, already seen that 
the moments and cumulants of the normal curve are expressible in simple 
forms. 

Now the normal form is reasonably close to-mahy distributions of the 
humped type. If, therefore, we are ignorant of the exact nature of a 
humped distribution, or know the form but find it mathematically intract- 
able, we may assume as a first approximation that the distribution is normal 
and see where this assumption leads us. It is not infrequently found that 
a population represented in this way is sufficiently accurately specified for 
the purposes of the inquiry. 

8-^7 Secondly, we shall find, when we come to consider sampling 
distributions, that many of the populations ^^ch occur are of the normal 
form; eith^ exactly or to a satisfactory degree of approximation. 
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8.38 Thirdly, the theory of the normal curve has been applied to the 
graduation of curves which are not normal. 



Fig. 8.3. — The distributioii of stature for adult males in the British Isles (fig. 4.6, page 83)» 

fitted with a normal curve 

To avoid confusing the figure, the frequency-polygon has not been drawn in, the tops 
of the ordinates being shown by small circles. 


It is possible to develop a technique for expressing a given distribution 

in the form of an infinite series whose terms depend on the quantity 
and certain dependent functions. 

8.39 Fourthly, distributions which are not normal can sometimes be 
brought to a form approximating to the normal by a transformation of 
the variate. A population which is skew with respect to a variate x, for 
instance, might be normal when we take x/'x as the variate. We gave an' 
example of this kind of effect in Exercise 4.6, page 100, where we saw that a 
population of men classified according to their weight was skew, whereas a 
population classified according to height (which we may take to be roi^'hly 
proportional to the cube root of the weight) is nearly normaL 

The Poisson distributioa 

8.40 We have found that the limit to the binomial would be a normal 
curve even if ^ and g were unequal, provided that n were increased sufficiently 
to make (g—P) small compared with Vnpg. We now propose to find 
the limit to the same series if one of the chances, say 9 , becomm indefinitely 
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small and n is increased sufficiently to keep finite, but not necessarily 
large — practical values are in fact usually small. 

Let us suppose that q is very small and that qn is equal to the finite 
number m. 

In the binomial (p+q)*, the term 
r I (n-f) r ^ 


Now the limit 


imit of ^1— 


as n becomes large=r<». 


■\(8 23) 

\ 


Applying Stirling’s approximation (8,16) when n is large, the term 

nt 


V2jmrW 

V2ir{n— r)r*+r{» 1 — 


(8.24) 


Now the Umit of ^ consider terms in which 

f exceeds quantities of the order V^, and the limits of 

are both unity. Hence the limit of (8.24) is unity, and the lii^t of (8.23) is 

nfr" 

TT 

841 Hence the successive terms in the Unomial are 


-!!!L 

21 


and the limit of ( 9 +^)* is 

t-m^ 


'+'*+2!+8i+ 


3t 


) 


etc. 


(8.25) 
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This expression is called Poisson’s distribution, or Poisson’s exponential 
Umit. It was first published by Poisson in 1837, but has subsequently 
been rediscovered by numerous writers. 

Constants of the Poisson distribation 

8.42 Taking an origin located at the first term of the distribution, we 
have — 

— • • • ) 
,..=Z[o+«+(=-;x2.)+(i;x3.)+ . . . ] 

=»»r^(l+^(l+l)+^*(2+l)+ . . . ) 

. . . +»»+^+ . . . ) 

=mer"(ef"+me") 

=w(»n+l) 

It may also lie shown that — 


/t j' =»«(m*+3fn +1) =«{(»» + 1 ) ® +«}■ 

/(4 '=«(»»• +6»»* +7«» + 1 ) 

From these results we have immediately — 

Mean— m 

. (8.26) 

—m* 

—m 

a—Vm . 

. (a27) 

Hence, 

a*Bm=Bmean 

8.43 The third and fourth moments about the mean win be found to be — 


. (8.28) 

/»4=3m*+«» 

. (8.2^ 

so that 
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/t,* m* 

_^B4 _ 3»t»+»» _3 ^ 1 

' /tj* w* »» 


These restilts should be compared with the expressions 


(8.30) 

(8.31) 




^ 1 = 3 + 


l-6^g 

Pqn 


for the binomial. They are, as might be expected, the limits of those 


tn 


expressions when q-- and « is large. 
n 

8.44 We may state without proof that ail the cumulants of the Poisson 
distribution are equal to m. 
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8.45 Tables of the limit r**— for various values of m and r have 

r I 

been published by several authorities. One such set will be found in 
Tables for Statisticians and Biometricians, Part I. 

The form of the frequency-polygon of the distribution (which, like the 
binomial and unlike the normal, is discontinuous) can be judged from 
fig. 8.4, in which the polygons for various values of tn are drawn. It will 
be seen that for low values of m the polygon is very skew, but that for 
larger values it tends towards a symmetrical form. 

8.46 The condition that p ox q shall be small, np ox nq remaining finite, 
implies that in practice we should expect to find a Poisson distribution 
in cases where the chance of any individual being a “ success " was smalL 
Such a case might arise, fot example, in considering the deaths from 
a rare disease in a population, the chance of any individual dying from 
it being small. 

8.47 Attention to the fact that comparatively rare events are not 
haphazard was first directed by Quetelet and von Bortkiewicz. The 
latter's data of the number of men killed by the kick of a horse in certain 
Prussian army corps in twenty years (1875-94) have become classicaL 

The frequency-distribution of the number of deaths in 10 corps per 
army corps per annum over twenty years was — 

Deaths Frequency 

0 109 

1 65 

2 22 

3 3 

4 1 


Here the total number of deaths was 122, and hence the mean deaths per 
army corps per annum is 0"61. Taking this a^ m, we find the following 
values for various numbers of deaths per annum — 


Deaths 

0 

1 

2 

3 

4 


Frequency assigned by 
Poisson's limit 


108-7 

66-3 

20-2 

41 

0-7 (4 and over) 


If we calculate o* for the actual distribution, we find — 
a«0-78, a»a*0-6079 



X94 theosy of statistics 

Hence, o* is nearly equal to the mean, which is in accordance with theory. 
The agreement is, in fact, very much closer than is usual. Many dis- 
tributions are now available for the frequency of individuals who have met 
with 0, 1, 2, . . . accidents, e.g. in factories, during a given period of time, 
and more often than not such distributions give a value of the variance 
exceeding the mean. This state of affairs can be accounted for on the 
assumption that the individuals at risk have varying degrees of “ accident- 
proneness,'' and the assumption has been corroborated by fin^g that 
those individuals who have the largest number of accidents in one period 
are, on the whole, those who have most accidents during a succeediiW period. 

A more modem example of the occurrence of the distribution \is given 
in the following data relating to the incidence of flying bombs (V^) in an 
area in south London. An area of 144 square kilometers was ^ected 
for which the mean density of bombs appeared constant. To t^t the 
hypothesis that the bombs fell in clusters the area was divided into 576 
squares of } kilometer each and a count made of the numbers of squares 
containing 0, 1. 2, etc. bombs, of which there were 537 altogether. A 
comparison with the frequencies given by a Poisson distribution is as 
follows (data from R. D. Clarke, 1948, Jour. Inst. Act., 72, No. 335) — 


Number of flying 
bombs per square 

Actual 
number of 
squares 

Theoretical number 
given by the 
Pcnsson distribution 

0 

229 

226-74 

1 

211 

211-39 

2 

93 

98-54 

3 

35 

30-62 

4 

7 

7-14 

5 and over 

1 

1-57 

Total 


576-00 


The agreement is extraordinarily close and there appears no evidence 
that the bombs “ clustered " otherwise than by chance. 

It is an interesting reflection that although the cavalry of 1875 developed 
into the fl}dng bomb of 1945 the laws of probability seem to have endured 
over this span of 70 years. 

Another example of the Poisson distribution is given in Exercise 8.17 
at the end of this chapter. The early instances of the distribution were 
nearly aU demographic, and for some time it remained more of a curiosity 
than a useful tool. In 1^, however, " Student ” drew attention to a 
class of hsemacytometer counts to which the distribution seemed approi^- 
ate, and since that time it has found several important biological applica- 
tions. It also appears in problems of controlling road and telephime traffic. 

Peanon carves 

848 The process of obtaining the normal curve as a limit of the bhtomial 
mggested to Karl Pearson an investtgation into a series of anakfoos 
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curves which may be regarded as limits to skew binomials or to distributions 
from a finite population, e.g. by drawing r balls at a time from a bag which 
contains a finite number N of black and white balls in given proportions. 
One such curve was of the form 

< v\ya -yx 

'+s) ‘ 

This set of curves, divided into twelve types, which were later regarded 
from rather a different standpoint, can be made to fit a large number of the 
distributions occurring in practice. 

In the curve given above, y, a and the origin can all be obtained from 
the first three moments. For the other curves of Pearson’s S 5 ^tem, 
except some degenerate types, the first four moments are necessary to 
specify the constants of the curve completely. The distributions con- 
sidered hitherto have required in addition to the area (number of observa- 
tions), either the mean only (Poisson) or the mean and standard deviation 
(normal curve) to determine their constants ; but the principle of fitting 
for the more general curves remains the same. The actual moments of 
the curves are equated to the moments expressed in terms of the constants, 
such as y and a, which are to be found. For full details of these curves, 
the method of determining the type to choose and the method of fitting, 
the student is referred to Elderton's Frequency Curves and Correlation 
and Kendall's Advanced Theory of Statistics, vol. 1. 


SUMMARY 


1. If the chance of the success of an event is p, and of its failure q, then, 
provided that the chance remains constant throughout the trials, the 
expected frequencies of 0, 1 , 2, . . . successes in N sets of n trials are the 
1st, 2nd, etc. terms in the binomial 

2. The mean of the binomial is pn and its standard deviation is Vnpq. 

3. For the binomial — 


A 


npq ’ 




I \ —epq 
^n 


4. If neither p nor q is small, the binomial tends for large values of n 
to the form 
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5. This carve, which may also be written 

is called the normal curve. 

6. The standard deviation of the normal curve is a. Its third moment 
is zero, and the fourth moment is 3a*. Hence 

/?,=3 

All cumulants higher than the second are zero. 

7. In the theory of errors the normal population is usually wriHen — 




0-kV 


A 

being called the precision. 

8. The mean deviation of the normal curve is 



0 79788 


and the quartile deviation (semi-interquartUe range) is 0-67448975 ... a 

9. A range 3a on each side of the mean of the normal curve contains 
0*9973 of the distribution. 

10. If ^ or ? is small and one of pn, qn is finite and equal to m, the 
binomial distribution tends to the limit 

. . . ) 

This is called the Poisson distribution. 

11. The mean of the Poisson distribution is m. and o* also equals m. 

12. For the Poisson distribution — 


fix 



fit ■ 3 +— 

m 


and all the cumulants are equal to m. 


EXERCISES 

8.1 A perfect cubic die is thrown a large nuinber of times in sets of 8. 
The occazTence of a 5 or a 6 is called a auccess. In what proportion of the 
sets would yoB expect 8 aoccesses ? 
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8.2 The following data, due to W. F. R. Weldon, show the results of 
throwing 12 dice 4,096 times, a throw of 4, 5 or 6 being called a success — 


Successes 

Frequency 

Successes 

Frequency 

0 

— 

7 

847 

1 

7 

8 

536 

2 

60 

9 

257 

3 

198 

10 

71 

4 

430 

11 

11 

5 

731 

12 


6 

948 

Total 

4,096 


Find the expected frequencies, and compare the actual mean and standard 
deviation with those of the expected distribution. 

8.3 In the previous example find the equation of the normal curve whidi 
has the same mean, standard deviation and total frequency as the observed 
distribution. 

Find the frequencies to be expected if the distribution were represented 
exactly by the ordinates of this curve and compare them with the actual 
frequencies. 

8.4 Assuming that half the population are consumers of chocolate, so that 
the chance of an individual being a consumer is i, and assuming that 100 
investigators each take ten individuals to see whether they are consumers, 
how many investigators would you expect to report that three people 
or less were consumers ? 

8.5 An irregular six-faced die is thrown, and the expectation that in 10 
throws it will give five even numbers is twice the expectation that it will 
give four even numbers. How many times in 10,000 sets of 10 throws 
would you expect it to give no even numbers ? 

8.6 If two normal populations have the same total frequency but the 9 
of one is k times that of the other, show that the maximum frequency of 

the first is ^ that of the other. 

8.7 Find graphically or otherwise the point of inflection of the normal 
Cu 4 , o, and show that it occurs at a distance a from the mean ordinate. 

8.8 Show that if be a whole number, the mean of the binomial coincides 

with the greatest term. 

8.9 Show that if two symmetrical binomial distributions of d^pwe n 
(and of the same number of observations) are so superposed that the fth 
term of the one coincides with the (r +l)th term of the other, the distriba* 
tion formed by adding superposed terms is a symmetrical binomial of 
degree (w-fl). 

[Nofe. — It follows that if two normal distributions of the same area and 
standard deviation are superposed so that the difi^ence between the 
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means is small compared with the standard deviation, the compound 
curve is very nearly normal.] 

8.10 Calculate the ordinates of the binomial 1,024 (0‘5+0’5)^, and 
compare them with those of the normal curve. 

8.11 If skulls are classified as dolichocephalic when the length-breadth 
index is under 75, mesocephalic when the same index lies between 75 and 80, 
and brachycephalic when the index is over 80, find approximately (assuming 
that the distribution is normal) the mean and standard deviation of a 
series in which 58 per cent are stated to be dolichocephalic, SS^per cent 
mesocephalic and 4 per cent brachycephalic. 

8.12 Find the deciles of the normal curve. 

8.13 Write down the normal population which has the same mdan and 
(uncorrected) standard deviation as that of the last column of Table 4.7, 
page 82. and find the mean deviation and quartile deviation. Compare 
the results with the corresponding quantities for the actual distribution. 

8.14 Proceed similarly for the skew population of Table 4.8, page 84. 

8.15 In Exerdse 10.4, if 1,000 investigators each choose 100 individuals, 
how many would you expect to report that more than 60 persons are 
consumers ? 

8. 16 Taking the population of screws of Table 4.3, page 72, find the normal 
population which has the same standard deviation and a mean of 1 inch. 
Compare the frequencies given by this population with the actual 
frequencies. 

8.17 The following data (Lucy Whitaker, Biometrika, 1914, 10, 36) give 
the number of deaths of w’omen over 85 published in The Times during 
1910-12— 


Number of deaths 
per day 

0 

1 

2 

3 

4 

5 

6 
7 


Frequency 

364 

376 

218 

33 

13 

2 

1 


Find the frequencies of the Pdsson distribution which has the same mean 
as this distribution, and compare your results with the actual frequencies. 
For the purpose of this example, sim^e interpolation in the tables given 
in TaMes for StaiisHdans and Biometridans is sufficient. 

&18 III the data of the pievhHis exmase calculate the first four 
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CORRELATION AND REGRESSION 


Bivariate populatioiis 

9.1 In Chapters 4 to 8 we considered the members of a population 
classified according to the values of a single variable ; and we saw how 
they could be grouped into a frequency-distribution whose character- 
istics could be described by certain constants. We have now to proceed 
to the case of two variables, in which each member of the population will 
exhibit two values, one for each of the variables under consideration. 

A population of this kind is called a bivariate population. One of our 
main topics will be the way in which the two variables are related in the 
population. 

9.2 If the corresponding values of the two variables are noted for each 
member, the methods of classification employed in the previous chapters 
may be applied to both variables. We can thus group our data into a 
table of double entry, or contingency table (Chapter 3), showing the 
frequencies of pairs of values lying within given class-intervals. Six 
^uch tables are given below as illustrations for the following variables : 
Table 9.1, two measurements on a sheU; Table 9.2, ages of husbands 
and their wives in marriages taking place in England and Wales in 1933 ; 
Table 9.3, statures of fathers and their sons ; Table 9.4, age and yield of 
milk in cows ; Table 9.5, the rate of discount and ratio of reserves to 
deposits in American banks; Table 9.6, the birth rate per thousand and 
the total numbers of births in the registration districts of England in 
1941. 

Arrays and cwrdation tables 

9.3 Each row in such a table gives the frequency-distribution of the 
first variable for the members of the population in which the second variable 
lies within the limits stated aa the left of the row. Similarly for the 
columns. As "columns” and "rows” are distinguished only by the 
accidental circumstances of the one set running vertically and the other 
horizontally, and the difference has no statistical significance, the word 
array has ^n suggested as a convenient term to denote either a row or 
a odumn. 

If the values of X in one array are associated with values of T in sm 
Interval centred at y*, then Ym is called the type of the array. 
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9 A A grouped frequency-distribution of the type of Tables 9,1 to 
9.6 may then be termed a bivariate frequency-distribution ; but if we are 
particularly interested in the relationship between the two variates it is 
sometimes called a correlation table. The difference between a correlation 
table and a contingency table lies in the fact that the latter term may 
be, and usually is. applied to tables classified according to unmeasured 
quantities or imperfectly defined intervals. 

9.5 We need add very little to what was said in Chapter 4 about the 
choice and magnitude of class-intervals and the classification of data. 
When the intervals have been fixed, the table is readily compiled from the 
raw material by taking a large sheet of paper ruled with arrays properly 

TABLE 9.2 — Correlation between ages of (1) husband and (2) wU^ In marriages In 
England and Wales in 1933 

Figures in bundred»-<ertain marriages in which no age was specified are omitted. 
(Data ifoitt Registrar-Generari Statistical Review of England and Wales for 1933. Tables. Part II, Civil) 






(1) Age of husband (Years) 





(2) Age of 















wife 

15- 

20- 

25- 

30- 35- 40- 45- 50- 55- 60- 65- 70- 75- 

Total 

(Years) 















15- 

33 

189 

56 

8 

2 

_ 

_ 





_ 

_ 


20- 

18 

682 

585 

106 

19 

5 

2 

1 

— 

— 

— 

— 



25- 

1 

140 

511 

179 

40 

14 

6 

3 

1 

1 

— 

— 

— 


30- 

— 

11 

75 

101 

42 

20 

10 

5 

2 

1 

1 

— 



35- 

— 

2 

10 

24 

28 

19 

13 

8 

5 

2 

1 

— 

— 

112 

40- 

— 

— 

1 

5 

9 

14 

12 

10 

6 

4 

2 

1 

— 

64 

45- 

— 

— 

— 

1 

3 

5 

9 

9 

7 

4 

3 

1 

— 

42 

50- 

— 

— 

— 

— 

— 

1 

3 

7 

6 

5 

3 

1 

— 

26 

55- 

— 

— 

— 

— 

— 

— 

I 

3 

5 

4 

3 

1 

— 

17 

60- 

— 

— 

— 

— 

— 

— 

— 

1 

1 

4 

3 

2 

— 

11 

65- 


— 

— 

— 

— 





— 

1 

1 

3 

2 

1 

8 

70- 

— 

— 

— 

— 

— 

— 

— 

— 

— 

— 

1 

1 

1 

3 

Total 

52 

1.024 

1.238 424 

143 

78 

56 

47 

34 

26 

Qj 

9 

2 

3.153 


headed in the same way as the nnal table and entering a small mark in 
the compartment corresponding to the variate values exhibited by each 
individual. If facility of checking be of great importance, each pair of 
recmded values may be entered on a separate card and these dealt into 
little packs on a board ruled in squares, or into a divided tray ; each pack 
can then be run through to see that no card has been mis-sorted. The 
difEiculty as to the intermediate observations — vdues of the variables 
corresponding to divisions between class-intervals — ^will be met in th^ same 
way as before if the value of one variable alone be intermediate, the sitit 
of ^quency being divided between two adjacent compartments. If both 
values of the pair be intermediates, the observation mwt be divided 
between /oar adjacent compartments, and thus quarters as wdl as halves 

H 
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may occur in table, as for example, in Table 9.3. In this case the 
statures of fathers and sons were measured to the nearest quarter-inch 
and subsequently grouped by 1-inch intervals ; a pair in which the recorded 
stature of the father is 60-5 in. and that of the son 62-5 in. is accordingly 
entered as 0*25 to each of the four compartments under the columns 
59*5-60*5, 60*5-61*5, and the rows 61*5-62*5, 62*5-63*5. 

Ftequency-surface and stereogram 

9j6 The distribution of frequency for two variables may be represented 
by a surface in three dimensions in the same way as the frequency- 
distribution for a single variable may be represented by a curve in two. 
We may imagine the surface to be obtained by erecting at the centre of 
every compartment of the correlation table a vertical of length proportion- 
ate to the frequency in that compartment, and joining up the tops of the 
verticals. If the compartments were made smaller and smaller while the 
class-frequencies remained finite, the irregular figure so obtained would 
approximate more and more closely towards a continuous curved surface 
— a frequency-surface — corresponding to the frequency-curves for single 
variables of Chapter 4. The volume of the frequency-solid over any area 
drawn on its base gives the frequency of pairs of values falling within that 
area, just as the area of the frequency-curve over an interval of the base 
line gives the frequency of observations within that intervaL 

9.7 Similarly, a figure analogous to the frequency-polygon or the 
histogram may be constructed by drawing the frequency-distributions for 
all arrays of the one variable, to the same scale, on sheets of cardboard, 
cutting-out and erecting the cards vertically on a base-board at equal 
distances apart, or by marking out a base-board in squares correspon^g 
to the compartments of the correlation table, and erecting on each square 
a rod of wood of height proportionate tc the frequency. Such solid repre- 
sentations of frequency-distributions ior two variables are sometimes 
termed stereograms. 

*19.8 It is impossible, however, to group the majority of frequency- 
surfaces, in the same way as the frequency-curves, under a few ample 
types: the forms are too varied. The suuplest ideal type is one in which 
eveiy section of the surface is a symmetrical curve — the first type of 
Ch^ter 4. fig. 4.5, page 81. like the symmetrical distributum for the 
single varialde, this is a very rare form of distribution in gnomic statistics^ 
but apivoxunate illustrations may be drawn from anthropometry. F%. 
9.1 shows the ideal form of the suriace, somewhat truncated, and fig. -9.3 
^ distribution of Table 9.3, which approximates to the same type— 
the dii^pence in steepness is, of course, merely a matter qi scale. The 
mMiTHum frequency occurs in the centre of the whole distributiott, and 
the surfikoe is symmetrical round the vertkal through the msEimum, ^ual 
IteqiiNicies occurring at equal distances from dm mode on qppo!^ 



TABLE 9.4---Gorr^tiofi behmti (1) age In yean end (1) yidd of milk per week fai 4,912 Aynkte cows 

(Dftta ftam J. F. Tocher, ** An Inveetlgetkn of the liilk YMd d Deity Cow,- BiomOrika, 1928, WB, 108) 
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♦Aiiy^ 9.5— Coffdatioii Mwccn (1) call discount rates and (2) percentage of reserves on deposits In New York Associated Banks 

(Weekly Returns) 

«« im. ih^ VarS Untuty Marka.** by T. P. Nofftcm. publieahoHS of Ihe Depar^iunt of the Social Scuncts, YaU University i The M»anillan Co., ISO^ 
Note that, after the column headed 8 per cent, blank columns have been omitted to save space. 
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TABLE M — Showing the monthly index-numbers of prices of (1) animal feeding-stnfii 
and (2) home-grown oats in England and Wales for 1931-1935 

The index-numbers arc based on pricea in corresponding months of 1911-1913 
(Data fiom Agricultural Market Report for Boa^d and Wales) 


Month 

Index of 
feeding-stuRs 
price 

Index of 
oats 
price 

Month 

Index of 
feeding-8tu0s 
price 

Index of 
oats 
price 

1931 Jan. 

78 

84 

1933 July 

85 

75 

Feb. 

77 

82 

Aug. 

83 

79 

Mar. 

85 

82 

Sept. 

80 

78 

Apr. 

88 

85 

Oct. 

78 

78 

May 

87 

89 

Nov. 

80 

76 

June 

82 

90 

Dec. 

83 

75 

July 

81 

88 




Aug. 

77 

92 

1934 Jan. 

82 

80 

Sept. 

76 

83 

Feb. 

83 

91 

Oct, 

83 

89 

Mar. 

85 

87 

Nov. 

97 

98 

Apr. 

83 

84 

Dec. 

93 

99 

^fey 

82 

81 




June 

85 

83 

1932 Jan. 

95 

102 

July 

88 

83 

Feb. 

97 

102 

Aug. 

101 

92 

Mar. 

102 

105 

Sept. 

102 

98 

Apr. 

99 

105 

Oct. 

98 

94 

May 

97 

107 

Nov. 

96 

94 

June 

94 

107 

Dec. 

98 

95 

July 

94 

101 




Aug, 

97 

106 

1935 Jan. 

98 

too 

Sept. 

92 

96 

Feb. 

92 

99 ^ 

Oct. 

89 

90 

Mar. 

92 

96 

Nov. 

90 

85 

Apr. 

90 

98 

Dec. 

90 

81 

May 

88 

97 




June 

86 

98 

1933 Jan. 

92 

84 

July 

83 

99 

Feb. 

91 

85 

Aug. 

80 

92 

Mar. 

90 

84 

Sept, j 

81 

90 

Apr. 

86 

81 

o<5. 1 

86 

89 

May 

85 

76 

Nov. 

83 

‘ 87 

June 

85 

77 

Dec. 

82 

83 


The next simplest type of surface corresponds to the second type of 
frequency-curve — the moderately asymmetrical. Most, if not all, of the 
distributions of arrays are asymmetrical and like the distributions of fig. 
4.7 ; the surface is consequently asymmetrical, and the maximum does 
not lie in the centre of the distribution. This form is fairly common, and 
illustrations might be drawn from a variety of sources — economics, 
meteorology, anthropometry, etc. The data of Table 9.4 will serve as an 
example, llie total distributions and the distributions of the majority 
of the arrays are asymmetrical, the rows being markedly so. The maximum 
frequency lies towards the upper end of the table in the compartment 
under the row headed ** 16 " and column headed ** 4 The frequency 
falls off very rapidly towards the lower ages, and slowly in the direction 
of old age. 

Apart from these two forms, it s^ms impossible to ddixnit emi^ricaSiy 
any dmple types. Tables 9.5 and 9.6 are ^ven as illustrations 
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two very divergent forms. Fig. 9.2 gives a graphical representation of the 
former by the method corresponding to the histogram of Chapter 4, the 
frequency in each compartment being represented by a square pillar. The 
distribution of frequency is very characteristic, and quite different frmn 
tliat any of the Tables 9.1 to 9.4. 
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scatter diagram 


9^ There is another method of representing bivariate data graphically 
which is particularly useful for ungrouped data. Take, for instance, 
the data of Table 9.7, giving the index-numbers of prices of animal feeding- 
stufis and home-grown oats fw each month of the years 1931-35. There 
ate only 60 pairs of values, and the dita cannot be grouped into a 
frequency-distribution with class-intervals of reasonable size without 



fig. M.— Scatter dtegraa of biSei-DinBlierB of indces of (1) aotanl ftediag-stafte aad 
W ho mc -g ro wa oate (TaUe 9.7) 

For tiM meaning of the atraig^t lines, see Example 9.1. page 223 


giving rise to irregular frequencies. We may, however, proceed as 
fdlows — 

On aqoared paper take two axes at right angles, one axis correspon^ng 
to the variable X and the other to. the variable Y (see fig. 9.4). To each 
member of the population there win cone^nd a pair of vahies ir, F, which 
intia»wiUc(nx«^(mdtoap(mtwlN»eabscisBaonthediagiamia JT and 
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whose ordinate is Y. Thus the population, when represented in this way, 
will give a swarm of points on the diagram, and we can interpret the ways 
in which these points cluster or scatter as properties of the relationship 
between the two variables. Fig. 9.4 shows the data of Table 9.7 plotted 
in this way. It will be observed that the points tend to distribute them- 
selves so that high and low values of X correspond to high and low values 
of Y respectively. 

Such a figure is called a scatter diagram. 

9.10 We can also represent a ^ouped bivariate frequency tabl* on 
a scatter diagram, though less satisfactorily and with some labour. iFor 
this purpose axes are taken as before and abscissae and ordinates drawn to 
correspond to the divisions of the frequency table. The diagram will tMn 
be dmded into compartments corresponding to the compartments of me 
table. In each compartment we place a number of dots equal to tne 
frequency in the corresponding compartment of the table. We have, as a 
rule, no guide as to the disposition of these dots within their respective 
cells, and hence it is usual to place them in some symmetrical arrangement 
so that they are, as nearly as may be, spread uniformly through the cells. 

The dif&culty of inserting the dots when the frequencies are large will 
be obvious, and, in fact, such a scatter diagram rarely tells us more than vft 

table itself. In contrast to this, tlie 
j 9.7 gives a much better picture of the 
n can be obtained by mere inspection 

table may be treated by the methods 
applicable to all contingency tables, 
however formed. But the coefficient of contingency merely tells us 
whether two variables are related, and if so, how closely. The meth^ 
we shall now discuss go much further than this. The numerical character 
of the variates and the arrangement of the correlation table in class- 
intervals of equal widths enable us to approach the problem of investigat- 
ing the relationship between the variates with additional precision. 

9.12 If the two variates in a contingency table are independent, the 
distributions in parallel arrays are similar (3.18) ; hence their averages 
and dispersions, i.e. their means and standard deviatmns, must be the same. 
In general they will not be the same, and we are thus led to inquire into the 
relation between the values of the means and standard deviations in 
different arrays and the departure of the distribution from complete 
independence. 

9.13 The mean is the most important constant, in general, and for 
the present we shall concentrate our attention upon it. Although the 
values in arrays are scattered about their respective means, it is in most 
cases profitable to inquire how the means of arrays are related ; this will' 


can from an inspection of the 
sca^^ppli^ram of the data of Tabl 
deMUgl^^l^the two variates tha 
of t ^^^l^ j^ ped data of the table 

9.1x^l lS ^lear that a correlation 
disetr^!^ m Chapter 3, which are 
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throw a good deal of light on the important question whether high values 
of one variate show any tendency to be associated, on the average, with high 
values of the other variate. 

If possible, we also wish to know how great a divergence of one variate 
from its mean is associated with a given divergence of the other, and to 
obtain some idea of how closely the relation is usually fulfilled. 

Lines of regression 

9.14 Let us then consider the means of arrays. Let OX, OY be two 
axes at right angles representing the scales of the two variates. As in 
the case of the scatter diagram we can plot the positions of the me^ ; for 
example, if the mean of a row whose variate value is centred at y, is «j, 
we can plot the point whose abscissa is and whose ordinate isy^. There 
will thus be one point corresponding to each row and one to each column. 
In practice, to distinguish the two, the means of rows are denoted by small 
circles and the means of columns by small crosses. Fig. 9.8 shows such 
a diagram drawn for the data of Table 9.3. 

The means of rows and the means of columns will, in general, lie more 
or less closely round smooth curves.^ For example, in fig. 9.8 they lie, 
very approximately, on straight lines, RR and CC in the figure. Such 
curves are said to be curves of regression, and their equations with reference 
to the axes OX and OY are called regression equations. If the;>9^. of 
regression are straight, the regression is said to be linear. I^ q^piry 
case it is said to be curvilinear. ^ 

9.15 The term " regression ” is not a particularly hapi^§HPl*>™ 
the etymological point of view, but it is so firmly eml^dded ri^^estical 
literature that we make no attempt to replace it by an expre^bn which 
would more suitably express its essential properties. It was introduced by 
Galton in connection with the inheritance of stature. Galton found that 
the sons of fathers who deviate x inches from the mean height of all fathers 
themselves deviate from the mean height of all sons by less than x inches, 
i.e. there is what Galton called a “ regression to mediocrity.’’ In general 
the idea ordinarily attached to the word " regression ” does not touch 
upon this coimotation, and it should be regarded merely as a convodent 
term. 

9.16 If two variates are independent, their regression lines are straight 
and at right angles, the means of rows l 3 dng on a line parallel to the 
axis OY and the means of columns on a line parallel to the axis OX, 
for the distributions in parallel arrays are similar (see fig. 9.5}. In any 
case drawn from actual data, of course, the means might not lie exactly on 
straight lines, owing to fluctuations of sampling. 

9.17 The cases with which the experimqtitalist, e.g. the chemist mr 
physicist, has to deal, where the observations are all crowded closely 
round a single line, lie at the (Opposite extreme from indepoedenoe. The 



entries fall into a few compartments only of each array, and the means of 
rows and of columns lie approximately on one and the same curve, like 
the line RR of fig. 9.6. 

9.18 The ordinary cases of statistics are intermediate between these 

two extremes, the lines of means being neither perpendicular as in fig. 9.5, 
nor coincident as in fig. 9.6. One problem of the statistician is to find 
expressions which will suffice to describe the regression lines, either exactly 
or to a satisfactory degree of approximation. / 

In general this is a difficult problem, and the theory of curvilmear 
regression is as yet incomplete. We can, however, make considemble 
progress by confining ourselves to the cases in which the regression is linear. 
Cases of this kind are more frequent than might be supposed, and in o^er 
cases the means of arrays lie so irregularly, owing to the paucity of ijhe 
observations, that the real nature of the regression curve is not indicated 
and a straight line will give as good an approximation as a more elaborate 
curve, 

9.19 Consider the simplest case in- which the means of rows lie exactly 
on a straight line RR (fig. 9.7), Let Mg be the mean value of Y, and 
let RR cut Mg*, the horizontal through M*, in M, Then it may be 
show^^at the vertical through M must cut OX in Mj, the mean of X. 
Foi^y||l|f'jslope of RR to the vertical, i.e. the tangent of the angle M^MR 
or tSuo to IM, be b^, and let deviations from My, Mx be denoted by * 
an<ly. 



Fig. 9S 
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y 


ng. 9.7 

Then for any one row of type y in which the number of observations 
is n, '!l(x)=nbjy. and therefore for the whole table, since 
£(je)=iijS(ny)=0. Afj must therefore be the mean of X, JafOTay 
accordingly termed the mean of the whole distribution. 

Knowing that RR passes through the mean of the distribution, we can 
determine it completely if we know the value of by 

For any one row we have 

Il(xy) =yS(*) =nbiy* 

Therefore for the whole table 

S(*y) ^bjZ{y*)n^Nbja,' 

Let us write 

p=^^[xy) (9.1) 


Then 



(9.2) 


Similarly, if CC be the line on which lie the means of columns and b^ is 
the slope to the horizontal, 



(9.3) 
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Now let us define 


Then 


P ^(xy) 


. (9.4) 


6,=f- and 6,=f^ .... (9.5) 

and the equations of RR and CC, referred to the centre of the distribution, 
are \ 


x=r—y and y=r-^x 


and, referred to the origin 0, 



(9.7) 

9.20 Let us now proceed to the case when the means of arrays are not 
situated on a straight line. This we shall treat by finding the next best 
thing — ^straight lines which are the closest fit to the means. 

The expression “ closest fit," as applied to the fitting of curves to points, 
is one.^hich we deal vdth at length in Chapter 15, and it is only necessary 
to say at this stage that the straight line RR of closest fit to the means of 
rows, i.e. 

will be determined by evaluating Uj and so as to make the expression 

E^-L{x~{a,^b,y)Y 

(that is, the sum of the squares of the horizontal distances of the points 
representing the observations from RR) a minimum. Here * and y, 
as before, denote deviations from the respective means of X and Y, and 
the summation is taken over all values of x and y. 

We have, expanding E, 

E -E(«i*) --22{ai(* -b^y)} +S(* -b~y ) » 

The second term on the right vanishes, since 2(x)=£(y)«0 and hence 

£=i:(ai*)+E(*-6,y)* 

Now and b^ can be chosen independently, and hence £ is a minimum 
only if i:(fl,*)=.0, i.e. 


«i=0 (9.8) 

Thus the line of closest fit goes through the mean of the distribution. 
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Hence. 


£=S(*-6i3»)* 

S(W 'SCy*) 




V*)j 


w- 


S{ 

S(**) 



(9.9) 


This is a minimum when the first term (a square) is zero, i.e. when 




_i:{xy) 

^{y*) ■ 


. (9.10) 


which is the same as equation (9.2). 

We may show similarly that the line of closest fit CC, given by 


has 


y=ai+6,a; 


a,=0, 




S(**)’ 


which is the same as equation (9.3). 
If we regard the equation 


x=ai+biy 


as one for estimating x from y, we may take x—a^—biy as the error of 
estimation, and E will then be the sum of the squares of such errors. The 
condition that £ is a minimum is then equivalent to the condition that the 
sum of squares of errors of estimation shall be a minimum. This is one 
form of the so-called " Principle of Least Squares " (see Chapter 15). 

9.21 Equations (9.6) and (9.7) are thus* of general application. If tlw 
regression is exactly linear they give the lines of regression. If the 
regression departs from linearity, either owing to sampling effects or owing 
to real divergences, they give the " best " straight regression lines which 
the data admit. We may regard the equations as either (a) equations for 
estimating an individual x from its associated y (ory from its associated x) 
in such a way that the sum of squares of errors of estimation is a minimum ; 
or (b) equations for estimating the ntean of the x’s associated with a 
particular y (or the mean of y’s associated with a particular x) in such a 
way that the sum of the squares of errors of estimation is a minimum, 
each mean being counted proportionately to the number of observations 
on which it is based. 



2I8 


THEORY OF STATISTICS 


Coeffident of coadaVaa 

9.22 The coefhdent r defined in equation (9.4) is of very great importance. 
It is called the coefficient of correlation, 
r cannot exceed +1 or be less than —1. 

For, from equation (9.9) we see that the value of £ is 

But £ is the sum of a number of squares and cannot be negative. 
Hence, i 

which proves the result. \ 

If f=+l, the regression equations are identical, as may be seen from 
equations (9.6), and hence the lines RR and CC coincide. In this case it 
follows from (9.11) that for all pairs of values of the variates 

x—biy=0 

i.e. all values lie on a single straight line. Thus to one value of x there 

Father^ stature 



Hg. • A — CoRdatlon bctwetn ttatnrc of fatha and statut of mm (TSMe iUQ 

Mea« of TOWS shewn by and means of c<*um» by cnoa»: rw+O’St 




CORSBLATION AND REGRESSION 


2X9 


corresponds one, and only one, value of y. This is the case we mentioned 
in 9.17, and since high values of x correspond to high values of y, the 
variables may be said to be perfectly positively correlated. 

Similarly, if r= — 1, the pairs of values all lie on a single straight line as 
before, but high values of one will be associated with low values of the 


Age , in gears 



Means of rows shown by circles and means of ccdumns by crosses : r«» 4-0‘22 

Other. In this case we can say that the variates are perfectly ne^tively 
correlated. 

Finally, if the variates are independent, r is zero, for and are zero, 
and the lines of regression are parallel to OX and OY. It does not follow, 
however, that if r is zero the variates are independent ; the fact th?it t is 
zero implies only that the means of arrays lie scattered around two stra^ht 
lines which do not exhibit any definite trend away from the horizontal or 
the vertical as the case may be. Two variates for which r is zero may, 
however, be spoken of as uncorrelated. Table 9.6 will serve as a case 
whem the variates are almost imcorrelated but by no means independent. 
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r being small (0*17) (see fig. 9.10), but the coefficient of contingency C 
(for the grouping of Exercise 9.3) 0*30. Figs. 9.8 and 9.9 are drawn from 
the data of Tables 9.3 and 9.4, for which r has the values +0*51 and 
+0-22 respectively. The student should study such tables and diagrama 
closely, and endeavour to accustom himself to estimating the value of r 
from the general appearance of the table. 

It does not follow that if x andy are functionally related their correlation 
is unity, unless the relationship is linear. Cf Exercise 10.9. 


Coefridaito of regression 
9.23 The two quantities 





rOy 

Ot 


are called coefficients of regression, being the regression of * on y, or 
deviation in x corresponding on the average to a unit change in y, and 6, 
being similarly the regression of y on x. 

The coefficient of correlation is always a pure number, but the coefficients 
of regression are only pure numbers if the variates are the same in kind ; 

for they depend, on the ratio — , and consequently on the units in whiiffi 

<Ty 

X and y are measured. 

Since r is not greater than unity, one of the coefficients of regression is 

less than unity ; but the other may be greater than unity, if — or — be 
large. 


9.24 The two standard deviations, 


are of considerable importance. It follows from (9.11) that Sm is the 
standard deviation of {x—b^), and similarly Sy is the standard deviation 
of (y—b^). Hence we may regard s» and Sy as the standard errors (root- 
mean-square errors) made in estimating x from y and y from x by Ibe 
respective regression equations 

Sm may also be regarded as a kind of average standard deviation of a row 
about RR, and Sy as an average standard deviation of a column about CC. 
In an ideal case, where the regression is truly linear and the standard 
deviations of all parallel arra 3 rs are equal, a case to which the distribution 
of Table 9.3 is a rough approximation,* $> is the standard deviation of the 
x-anay and sy the standard deviation of the y-array. Hence s< and Sy are 
sometimes termed the " standard deviations of arrays." 


^ TaUas in which the atandaid deviations of anays are equal are sometimea Mid 
to be ” homoscedaitic " ; in the contrary cue ** heteroacedastio." 
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Calculation of the coefficient of onrelation 

9^ We now proceed to the arithmetical work involved in calculating 
the correlation coefficient. 

For this purpose we use the formula (9.4), i.e. 

Ntj^oy V'L{x')Z{y*) 

The calculation of S(**), or o*, and of or a,, proceeds exactly 

as in Chapter 6 . The only expression of a novel type is the quantity 

i r(xy), which we may call the first product-moment or the covarmnce 

of the distribution.* As in the case of univariate distributions, the firm 
of the arithmetic is slightly different according as the observations \ire 
grouped or ungrouped. 

9J26 Our work is greatly simplified by the use of devices similar to those 
employed in calculating the means and other moments of univariate 
distributions. 

(a) We take working means for the two variates, obtained by inspection, 
and transfer our moments to those about the means after the bulk of 
the arithmetic has been performed. For the first product-moment We 
have, in fact, if g, 7 are the deviations from the working means and 
7 the deviations of the true means from the working means — 

€=*+|. 7=y-|-7 

Hence, 

g7=jry4.|y+*^^-|^ 

Summing for all membfers of the population, since S(|y)=|E(y)=0 ai^ 
similarly = 0 , x and y being deviations from the true means, 

2(l7)=S(xy)-hiV|7 

Hence, 

S(xy)*S(£7)-i^l7 (9.12) 

This gives us the product-moment about the true means in terms of 
the product-moment about the working means and ^e deviations of the 
true means fmm the working means. 

* In generalisation of tbe definition of moments of a univariate distribution in 
Chapter 7 we may define the product-moments of a Mvariate population as 

.where/iathefreqnencyandthe variates are measured horn their means. This gives us 

tim quantity we have called p in equation (9.1), 
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(i) As a check on the rather heavy arithmetic which is frequently 
involved, it is advisable to use a method similar to that of 6.11. We have 

S(g+l)(7+l)=S(l7)+S(g)+2:(7)+iV . . . (9.13) 

If, therefore, we calculate r(f+l)( 7 +l) as well as E(g^), we shall have in 
the above equation a check on the accuracy of our work. 

(c) We take the class-intervals as units and transfer to other units 
afterwards as desired. 

Example 9.1, Table 9.8. — ^Let us investigate the correlation and re- 
gressions of the variates of Table 9.7, the data of which are ungrouped. 
The variates are (1) the price index-number of animal feeding-stuffs, X, 
and (2) the price index-number of home-grown oats, Y. The values of 
the variates themselves are shown in columns 2 and 3 of Table 9.8. We 
take a working mean at X =90 and Y =90, and the deviations from these 
values are shown in columns 4 and 5. The remaining columns 6 to 13 
give the squares and product of the deviations together with the various 
auxiliary quantities used for checking purposes. Finally, the various 
sums are shown at the bottom of the table. 

In practice it is as well to show the negative values which may occur in 
columns 4, 5, 6, 7, 12 and 13 (particularly the last two) in a separate column, 
so as to facilitate addition and avoid mistakes. We have refrained from 
this course for convenience of printing. 

As check on the arithmetic we have — 

-1 18=S(g) =S(g-t-l) -N= -58-60 

2,924=S(S-t-l)*=i:(£*) -|-2S(g) -1-2^=3,100 -236-1-60 

etc., and 

2,493=S(£+1)(7 +1) +S(0 +2(7) +N 

=2,565-118-14-1-60 
=2,493 


We have, then, about the working meams — 
5=-^ =-0 2333 


3,100 - 

o,*=-^Q 1*=47-7989, a,=6-914 

ffy*=^^-?=*80-1789, (ry=8-954 

Dv/ 


^ S(xy) Zm 
N N 
p 42*291 1 
'’*'o^,ff,“61-9080 


■^=42-75 -0-4589 =42 -2911 
= - 1 - 0-68 
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TABLE 9.8 — Concdatioii between monthly Indez-nnmbcn of price* of (1) tnlnud 
fbedlng-ctaflfii and (2) honw-gfown oat* in yean 1931>35 


1 

Month 

2 

X 

3 

y 

4 

1 

5 

V 

6 

«+l 


8 

f* 

9 

(S+1)’ 

10 

V* 

11 

(,+!)> 

12 

in 

13 

1931 Jan. 

78 

84 

-12 

- 6 

-11 

- 5 


KB 

36 

25 

72 

55 

Kb. 

77 

82 

-13 

- 8 

-12 

- 7 


KSI 

64 

49 

104 

84 

Mar. 

85 

82 

- 5 

- 8 

- 4 

- 7 



64 

49 

40 


Apr, 

88 

85 

- 2 

- 5 

- 1 

- 4 

mO 

nn 

25 

16 

10 


May 

87 

89 

- 3 

- 1 

- 2 

— 

9 

4 

1 

— 

3 


June 

82 


- 8 

— 

- 7 

1 

64 

49 


1 

— 


July 

81 

88 

- 9 

- 2 

- 8 

- 1 

81 

64 

4 

1 

18 


.Auy. 

77 

92 

-13 

2 

-12 

8 

169 

144 

4 

9 

-26 


Sapt. 

76 

83 

-14 

- 7 

-13 

- 6 

196 

169 

49 

36 

98 


Oct 

83 

89 

- 7 

- 1 

- 6 


49 

36 

1 


7 


Nov. 

97 

98 

7 

8 

8 

9 

49 

64 

64 

81 

56 


Dec. 

93 

99 

3 

9 

4 

10 

9 

16 

81 

100 

27 


1982 Jan. 

95 

ffjl 

5 

12 

6 

13 

25 

36 

144 

169 

60 

78\ 

Kb. 

97 

irivl 

7 

12 

8 

13 



64 

169 

84 

104 ' 

Mar. 

2 

li'ia 

12 

15 

13 

16 


^Bw 

225 

256 

180 

208 

Am, 


105 

9 

15 

10 

16 



225 

256 

135 

160 

iSy 

2 


7 

17 

8 

18 


^B|t 

289 

324 

119 

144 

June 



4 

17 

5 

18 



289 

324 

68 

90 

July 



4 

11 

5 

12 



121 

144 

44 

60 

Aug. 

5 

106 

7 

16 

8 

17 



256 

289 

112 

136 

Sept 

2 

96 

2 

6 

3 

7 



36 

49 

12 

21 

Oct 


90 

— 1 



1 

1 

... 


1 


... 

Nov. 


85 


- 5 

1 

- 4 

— 

1 

25 

16 


- 4 

Dec. 

ill 

81 

— 

- 9 

1 

- 8 

— 

1 

81 

64 

— 

- 8 

1933 Jan. 

H 

84 

2 

- 6 

3 

- 5 

4 

9 

36 

25 

-12 

-15 . 

Kb. 

in 

85 

1 

- 5 

2 

- 4 

1 

4 

25 

16 

- 5 

- 8 

Mar. 

90 

84 


- 6 

1 

- 5 


1 

36 

25 

.1.. 

- 5 

Aw. 

86 

81 

- 4 

- 9 

- 3 

- 8 

16 

9 

81 

64 

36 

24 

Iby 

85 

76 

- 5 

-14 

- 4 

-13 

25 

16 

196 

169 

70 

52 

June 

85 

77 

- 5 

-13 

- 4 

-12 

25 

16 

169 

144 

65 

48 

July 

85 

75 

- 5 

-15 

- 4 

-14 

25 

16 

225 

196 

75 

56 

Aug. 

83 

79 

- 7 

-11 

- 6 

-10 

49 

36 

121 

100 


60 

Sept 

80 

78 

-10 

-12 

- 9 

-11 

mssM 

81 

144 

121 


99 

Oct 

78 

78 

-12 

-12 

-11 

-11 

144 

121 

144 

121 


121 

Nov. 

80 

76 

IBEl 

-14 

- 9 

-13 

mEm 

81 

196 

169 


117 

Dee. 

83 

75 

- 7 

-15 

- 6 

-14 

49 

36 

225 

196 


84 

1934 Jan. 

82 


- 8 

BEI 

- 7 

- 9 

64 

49 

100 

81 

80 

63 

Kb. 

83 



1 

- 6 

2 

49 

36 

1 

4 

- 7 

—12 .. 

Mar. 

85 



- 3 

- 4 

- 2 

25 

16 

9 

4 

15 

8 ^ 

Ajpr. 

83 



- 6 

1 - 6 

1 - 5 

49 

36 

36 

25 

42 

80 

May 

82 



- 9 

- 7 

- 8 

64 

49 

81 

64 

72 

56 

June 

85 


Bui 


- 4 

- 6 

25 

16 

Kl 

' 86 

85 

24 

July 

88 




- 1 

- 6 

4 

1 


36 


6 

Aug. 

mm 

92 

11 




121 

144 

4 

9 


36 

Sept 

102 

98 

12 




144 

169 

64 

81 


117 

Oct 

98 

94 

8 


9 

5 

64 

81 

16 

25 

32 

45 

Nov. 

96 

94 



7 

5 

36 

49 

16 

25 

24 

85 

Dec. 

98 

95 

8 

5 

9 

6 

Kl 


25 

86 


54 

1938 Jan. 

98 


8 

10 

9 

11 

El 


100 

121 


99 

Feb. 



2 

9 

3 

10 



81- 

100 


30 

Mer. 

'/M 

96 

2 

6 

3 

7 

■1 


36 

49 


21 



98 


8 

1 

9 

.... 

1 

64 

81 


9 

fiCr 


97 



- 1 

8 

4 

1 

49 

64 

EH 

- 8 

June 


98 

Bn 


- 3 

9 

16 

9 

64 

81 


-27 

July 

88 

99 

BSlI 


- 6 

10 

49 

36 

Kl 

100 

-63 

-60 

Aug. 

80 

92 

EEll 


- 9 

8 

MEM 

81 

Hu 

9 

-20 

-27 

Sept 

81 

90 

0 


- 8 

1 

81 

64 

Bl 

1 

.... 

- 8 

Get 

86 

89 

- 4 

^ 1 

- 3 

.. 1 . 

16 

9 

m 

.i.. 

4 

.M. 

Nov. 

83 

87 

- 7 

- 3 

- 6 

- 2 

49 

36 

Kl 

■u 

21 

12 

Dec. 

82 

83 

- 8 

- 7 

- 7 

- 6 

64 

49 

49 

Bl 

56 

42 

Total 

n 

n 

SI 

-14 

—58 


B 


4,814 

4^46 

2,565 

2,493 
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Further, working the regressions in the way best to avoid errors in 
rounding oii, 


j ±==0-527 

ay* 

6*=± =0-885 

Ox* 

Thus the correlation coefficient is 0-68, and the regression equations, 
referred to the means, are — 

*=0-527y 

y=0-885* 

If we prefer to express these equations with origin at X=0, y=0, 
we have — 


X-(90-l-97)=Jf-88-03 =0-527(Y-89-77) 
y - (90 -0 - 23) = y -89 - 77 =0 - 885(X -88 - 03) 

which reduce to 


X=0-527y +40-72 .... (a) 
y=0-885X+ll-86 .... (6) 

The lines of regression are drawn on the scatter diagram of fig. 9.4. 
The standard errors made in using these equations to estimate the 
index-number of oats from animal feeding-studs, and vice versa, are — 


<TyVl-r*=6-57 

Equation {a) tells us that a rise of one i>oint in the price index-number of 
oats is accompanied on the average by a rise of 0-527 point in the price 
index-number of feeding stuffs. Similarly, equation {b) tells us that a 
rise of one point in the index for feeding-stuffs is accompaniedofttA^ averse 
by a rise of 0-885 point in the p^ce of oats. 

It is important to note that the regression equations do not tell us 
whether a variation in one variate is caused by a variation in the other ; 
all we know is that the two vary together, and so far as the regression 
equations show, either the feeding-stuffs price may exert an influence on 
the oats price, or vice versa, or their common variation may be due to 
some other cause affecting both. This is only one instance of a difficulty 
which pervades the theory of correlation and regression, namely, that 
of interpreting results in terms of caus^ factors. 
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Example 9.2, Table 9.9. — We now consider an example based on 
grouped data. In this we have omitted the auxiliary quantities necessary 
for checking in order to save space. 

(Unpublished data ; measurements by G. U. Yule.) The two variables 
are (1) X, the length of a mother-frond of duckweed {Lcmna minor) ; 
(2) y , the length of the daughter-frond. The mother-frond was measured 
when the daughter-frond separated from it, and the daughter-frond when 
its first daughter-frond separated. Measures were taken from camera 
drawings made with the 2Ieiss-Abbe camera under a low power, th.'t^je^ctual 
modification being 24 : 1. The units of length in the tabulated ^liiMure- 
ments are millimetres on the drawings. 

The arbitrary origin for both X and Y was taken at 105 mm. The 
following are the values found for the constants of the single distributions — 

^=—1*058 intervals=— 6-3mm. 98'7 mm. on drawing 

= 4*11 mm. actual 

o,»= 2*828 intervals = 17*0 mm. on drawing = 0*707 mm. actual 

^=—0*203 interval=— 1 *2 mm. Afj=103*8 mm. on drawing 

= 4*32 mm. actual 

Oy— 3*084 intervals= 18*5 mm. on drawing= 0*771 mm. actual 

To calculate £(^7) the value of gy is first written in every compart- 
ment of the table against the corresponding frequency, treating the class- 
interval as unit. In Table 9.9 frequencies are shown in ordinary t 3 q>e 
and the values of in heavy type. In making these entries the sign 
of the product may be neglected, but it must be remembered that this 
sign will be positive in the upper left-hand and lower right-hand quadrants, 
and negative in the two others. The frequencies are then collected, 
according to the magnitude and sign of in columns 2 and 3 of Table 
9.10. When columns 2 and 3 are completed they should be checked 
to see that no frequency has been dropped, which may readily be done 
by adding tc^ether the total of the two columns and the frequency 
in the 8th row and 8th column of Table 9.9 (the row and column for 
which iy—0), care being taken not to count twice the frequency in the 
compartment common to the two. This grand total must clearly be 
equd to N, the total number of observations, which in this case is 266. 
The numbers in column 4 are given by deducting the entries in column 3 
from those in column 2. The totals so obtained are multiplied by iy 
(column 1) and the products entered in column 5 or 6 according to sign. 
1116 algebraic sum of these totals gives 


S(S9)“+1519*5 
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TABLE 9.10 


I 

2 3 

Frequencies 

4 

Total 

5 6 

Products 

+ 

Quadrants 

Quadrants 

+ 

- 

1 



8*5 

- 8*5 


8*5 

2 

17 

13*5 

-f 3*5 

7 

— 1 

3 

10*5 

9 

+ 1*5 

4*5 

— 


13-5 

6*5 

+ 7 

28 

— \ 


2 

0*5 

+ 1*5 

7*5 

-\ 

'6 

13*5 

5 

+ 8*5 

51 


8 

13 

1 

+ 12 

96 


9 

9 

4 

+ 5 

45 

- \ 

10 

6*5 

1 

+ 5*5 

55 


12 

17-5 


+ 17*5 

210 

14 

1 


+ 1 

14 

— 

15 

6 

— 

+ 6 

90 

— 

16 

7 

— 

+ 7 

112 

— 

18 

2 

— 

+ 2 

36 

— 

20 

8 

— 

+ 8 

160 

— 

21 

2 

— 

+ 2 

42 

— 

24 

6 

— 

+ 6 

144 

— 

25 

1 

— 

+ 1 

25 

— 

28 

1 

— 

+ 1 

28 

— 

30 

3 

— 

+ 3 

90 

— 

36 

1 

— 

4- 1 

36 

— 

40 

1 

— 

+ 1 

40 

— 

42 

2 

— 

+ 2 

84 



60 

1 

— 

+ 1 

60 


63 

1 

— 

+ I 

63 

— 

Totals 

145*5 

49 

71*5 

266 

49 





Hence, dividing by 266, 


l2(£7)=5-712 

^«=5-712-|7=5-712-0-2l5 

=5*497 

Hence. 

p 5*497 

'’“a,a,~2*828 x 3*084“'*'®'®® 

The r^ession of daughter-frond on mother-frond is 0*69 (a value 
which will not be affected by alteriz^; the units of measurement for both 
awJther- and daughter-fronds, as such an alteration will aficct both 
standard deviations equally). Hence, the regtession equation giving the 
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average actual length (in millimetres) of daughter-fronds for mother-fronds 
of the actual length X i<« 

y=l'48-t-0-69X 

We leave it to the smdent to work out the second regression equation 
giving the average length of mother-fronds for daughter-fronds of length Y, 
and to check the whole work by a diagram showing the lines of regression 
and the means of arrays for the central portion of the table. 

Example 9.3, Table 9.2. — ^The following device is frequently useful, 



We have — 


S(*-y)*=E(**)- 2 S(*y)-HE(y«) . . . (i) 

and 

I.{x+y)*=i:(x*)+2L{xy)+-L(y*) . . . (u) 

Hence, knovdng £(*•) and £(y*), we can find S(a:y) if we know either 
E(*— y)* or These quantities are often easier to calculate than 

r{*y) itself. 

Consider the data of Table 9.2. In the usual way, taking a working 
mean centred in the intervals X =25- years, Y =25- years, we have, in 
units of five years — 

|=+0-2924 7 = -0-2353 

2(C») =9,708 =7,090 

o,=l-730 Oy=-481 

Now the value of 17 is constant down diagonals which run from the 
top left hand to the bottom right hand of the table. In fact, for the 
principal diagonal, running from X=15-,y=15- through X—20~, 
y=20-, etc., f— 17 = 0 . For the diagonal above this, running from 
^^=20-, y=15- through X=25-, y=20-, etc., 9=1, and so on. 

Let us then find the diagonal totals. We find — 



Frequency in 
diagon^ 

-3 

4 

-2 

34 

-1 

280 

0 

1,398 

1 

1,051 

2 

263 

3 

73 

4 

31 

5 

12 

6 

5 

7 

2 


3.1S3 
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The total is the total frequency, which gives a check on the work. 

The value of 2 (^— 9 )* for the whole table is then obtained from the 
above table by squaring the values in the left-hand column, multipljdng 
by the corresponding frequency in the right-hand column and adding. 
We get 

2(g_i,)s=(9x4)-|-(4x34)-l-(l X280)-|- . . . -f (49x2) 

=4,286 

Hence, from (i), 

4.286=9.708-1-7,090-22(^7) 

2(^7) =6,256 

whence 

- ^ _ 2-0529 

or*<Ty ^l-730xl-481 

This degression equations may now be obtained in the usual manner. 

In the above work we chose equation (i) in preference to equation (ii) 
because the frequencies are seen by inspection to run mainly from the 
top left hand to the bottom right hand of the table. Had they run from 
the top right hand to the bottom left hand we should probably have found 
it better to use equation (ii). 

9.27 The student should be careful to remember the following points 
in working — 

( 1 ) To give 2 (^ 7 ) and (^ 7 ) their correct signs in finding the true mean 
deviation product p. 

(2) To express Ox and Oy in terms of the class-interval as a unit, m4he 
value of r^pjaxOy, for these are the units in terms of which p has been 
calculated. 

(3) To use the proper units for the standard deviations (not class- 
intervals in general) in calculating the coefficients of regression : in forming 
the regression equation in terms of the absolute values of the variables, 
for example, as above, the work will be wrong unless means and standard 
deviations are expressed in the same units. 

FlnctustlM^ «t sampling 

9.28 Further, it must alwa 3 is be remembered that correlation coefficients, 
like other statistical measures, are subject to fluctuations of sampling. 
We shall consider this point at some length in later chapters (18 and 21), 
since ^e correlation coefficient has certain individual features which 
make it of special interest from the sampling point of view. We may, 
however, at this stage stress that if the number of observations is small, 
no significance can be attached to small, or even moderately large, values 
of T as mdicating a real corrdation in the populatkm from wl^ch thfi 
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observations are drawn. For example, if N=36, a value of r*»±0-S may 
be a chance result, though a very infrequent one, in sampling from an 
uncorrelated population. If I'/ ==100, f=±b*3 may similarly be a mere 
fluctuation of sampling, though again a very infrequent one. The student 
should therefore be careful in interpreting his coefficients. 

Corrections for grouping 

9.29 In this connection we may mention the question whether, in calcu* 
lating the correlation coefficient from grouped data, any correction is 
to be made analogous to the Sheppard correction for groupiiyy Which 
we have considered in the case of univariate data. In the l^upples 
considered in the foregoing we have not made such corrections. 

It appears that, when the distribution is reasonably symmetrical and 
obeys conditions similar to those enunciated in 6.12, page 133, we may, 
with advantage, correct the standard deviations ax, ay, by applying to 
each the formula 


(corrected) =a* — ^ ^ 

where h is the width of the interval. The product term ^{xy) needs no 
such correction. 

We pointed out in 6.12, however, that sampling fluctuations usually 
obliterate any correction for grouping unless the size of the sample is large. 
It may, as before, be suggested that unless 1,000 or more, it is hardly 
worth while making the correction. For example, in Tables 9.1-9.6, 
Tables 9.1 and 9.5 have a frequency less than 1,000 and the corrections 
are not to be applied — in any case they would not be applied to Tables 
9.5 and 9.6, which violate the conditions as to " tapering off." 

9.30 FinaUy, it should be borne in mind that any coefficient, e.g. the 
coefficient of correlation or the coefficient of contingency, gives only a 
part of the information afforded by the original data or the correlation 
table. The correlation table itself, or the original data if no correlation 
table has been compiled, should always be given, unless conriderations of 
space or of expense absolutely preclude the adoption of such a course. 


SUMMARY 


1. A population every member of which bears one of the values of each 
of two variates is said to be bivariate. If the members are grouped 
according to class<*intervals of the two variables, we have a bivariate 
frequency-distribution. ^ 

2. The bivariate frequency-distribution may be represented by a 
Imjuency-surface or by a stereogram. Ungrouped data (and* less con* 
ven|ent}y* grouped data) can be represented on a scatter diagram. 
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3. The means of arrays of a bivariate frequency-distribution may be 
r^resented as points by reference to a pair of rectangular axes along 
which are measured values of the variables. The means of rows and 
those of columns trill in general lie respectively about two smooth curves, 
called lines of regression. The equations of these curves are called 
r^;ression equations.^ 

4. The regression equations may be regarded as expressions for 

estimating from a given value of one variate the average corresponding 
value of the other. i 

3. - Hjl coefficient of correlation (product-moment correlation coefqcient) 
between two variables X and Y is given — 

VS(**)S(y*) 

_ P 


.where x, y are the values of the variables measured from their respective 
means, and P—jf^ 

6. The correlation coefficient r cannot be less than —1 or greater than 

•4-1. If f=±l the variables are perfectly correlated, the points corre- 
sponding to pairs of values x, y all lying on a straight line. If r=— 1 
the variables are perfectly negatively correlated, low values of one 
corresponding to high values of the other. If f=-t-l the variables are 
perfectly positively correlated, high values of one corresponding to high 
values of the other. y 

7. The linear regression equation of on y (referred to axes through 
their respective means) is 


where 

and that of y on is 
where 


y^byx 

o» o»* 


hi and bf being called coefficients of regression, or simply regressions. 


*Car vflinw regre«gion lines, like stnU^t regression lines, may also be d^ned for 
extension of tbe pnndple of making sums of squans of emcs of 
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8. The straight lines of regression are such that the sums of squares 
of errors of estimate, S(*— and are a minimum. If the 

quotients of these sums by N are denoted by Ss*,$y*> 

s»*=o#*(l — f*) 
s,*=ov*(l — f*) 


■ 

EXERCISES 

9.1 Find the correlation coefficient and the equations of regression for the 
following values of X and Y — 

X Y 

1 2 

2 5 

3 3 

4 8 

5 7 

[As a matter of practice it is never worth calculating a correlation coef- 
ficient for so few observations : the figures are given solely as a short 
example on which the student can test his knowledge of the work.] 

9.2 (Data from W. Little : Labour Commission Report, Vol. 5, Part 1, 
1894, and'Offidal Returns.) 

The figures in the table on p. 234 show (1) the estimated average earnings 
of agricultural labourers, X, (2) the percentage of population in receqrt of 
poor law relief, Y, (3) the ratio of the number of paupers receiving outdoor 
relief to the number receiving relief in workhouses, Z, for certain districts 
in England and Wales in 1^. 

Find the correlations between X and Y, Y and Z, and Z and X. Dmw 
scatter diagrams to illustrate the various joint distributions. 

9.3 Verify the data in the table heading p. 235 for the U]^ider<«ientimied 
tables of this chapter. Calculate the means of rows and t^luda^aiid 
draw a diagram tiiowing the lines of regresrion for the datiiW^ilda % 
(Sheppard's cmrection used only in Table 9.4.) 

In calculating the coefficient of omtingency (coefficient of mean square 
contingency) use the fdlowing groupings, so as to avcud nnall scattered 
fcequencMS at the extremities ofthe taUes and also .excessive aritbsMtio— 

Table 9.1. Group together (1) two top rows, (SO three bottom reus, 
(^ two firri: cohmu^ four last oaim^ leaving eendace of table as 
itstaiids. 
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Table for Exercise 9.2 


Union 


Estimated 
average earnings 
of agricultural 
labourers 

i 

Percentage of 
population in 
receipt of 
Poor Law 
relief 

Ratio of number 
of paupers 
receiving 
outdoor relief 



Shillings and 

to the number 



pence per week 

receiving relief 






in workhouses 

1. Glendale . 

2, ^igton 


s. 

20 

20 

d. 

9 

3 

2-40 

2*29 

7 

6 >401 

4*04\ 

3, Garstang . 


19 

8 

1-39 

7-90 \ 

4. Mper 


18 

6 

1-92 

3-31 \ 

5. Nantwich . 


17 

8 

2*98 

7*85 \ 

6. Atcham 


17 

6 

1-17 

0*45 \ 
10>00 \ 

7. Driffield . 


17 

1 

3*79 

8. Uttoxeter . 


17 

0 

301 

4*43 

9. Wetherby 


17 

0 

2-39 

4-78 

10. Easingwold 


16 

11 

2-78 

4-73 

11. Southwell 


16 

6 

3*09 

6-66 

12. Hollingboum 


16 

4 

2-78 

1>22 

13. Melton Mowbray 


16 

3 

2*61 

4*27 

14. Truro 


16 

3 

4*33 

7*50 

15. Godstone . 


16 

0 

3*02 

4 >44 

16. Louth 


16 

0 

4-20 

8-34 

17. Brixworth 


15 

9 

1-29 

0*69 

18. Crediton . 


15 

8 

1 516 

9 >89 

19. Holbeach . 


15 

6 

! 4-75 

4>00 

20. Maldon 


15 

6 

4*64 

602 

21. Monmouth. 


15 

4 

4-26 

827 

22. St. Neots . 


15 

3 

166 

158 

23. SwafRiam . 


15 

0 

5-37 

16*04 

24. Thakeham 


15 

0 

3-38 

196 

25. Thame 


15 

0 

5-84 

9*28 

26. Thingoe . 


15 

0 

4 >63 

8*72 

27. Basingstoke 


15 

0 

393 

2-97 

28. Cirencester 


15 

0 

4-54 

5*38 

29. North Witchford 


14 

10 

3 >42 

3*24 

30. Pewsey 


14 

9 

5*88 

7*61 

31. Bromyard 


14 

9 

4*36 

5*87 

32. Wantage . 


14 

9 

3*85 

5*50 

33. Stratford*on-Avon 


14 

7 

3*92 

3*58 

34. Dorchester. 


14 

6 

4*48 

6*93 

35. Woburn , 


14 

6 

5-67 

6*02 

36. Buntingford 


14 

4 

4*91 

4*92 

37. Pbr^ore . 


13 

6 

434 

4*64 

38. Lan^ort . 


12 

6 

519 

10*56 


Table 9.3. Regroup by 2-inch intervals, 58*5-60*5. etc., for father, 
1®*5-6I*5, etc., for son. If a 3-inch grouping be used (58*5-61*5, etc., 
for both father and son), the coefficient of mean square contingency is 
0*465. 

Table 9.4. For columns, group those headed 3 and 4, 5 and 6, 7 and 8, 
9 and 10. 11 and over ; for rows, group those headed 8-11, 12-13, 14-15, 
16-17,18-19. 20-21, 22-23, 24-25, 28-27, 28 and over. 
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TaUe lot Bmcbe 9.3 



9.1 

9.3 

B 

9.6 

Mean of . 

„ y . 

Standard deviation of X . 

M M y . 

Coefficient of correlation . 

Coefficient of contingency^ 
(for the grouping stated } 
below) ; 

55*3 mm. 
531 .. 

6-86 .. 
5-77 ., 
+0*97 

67- 70 in 

68- 66 

2-72 

2-75 .. 

-fO-51 

6-22 yrs 
18-61 gal 
2*21 yrs 
3-37 gal 
4-0-22 

14-54 per thou. 
379-47 births 

2*87 per thou. 
505-24 births 
+0-17 

0*90 

0'51 

0-26 

0*30 


Table 11.6. For columns, take sii^ly those for 0-, 206-, group 400- 
and 600- and group 800- and over. Rows, group those headed 6-11, 
12 and 13, 14 and 15, 16-18, 19 and over. 

9.4 (Data from Statistical Review of England and Wales for 1933, Tables, 
Part 1, p. 3, and part 2, p. 6.) The following show mean annual birth 
and death rates in England and W'ales for quinquennia since 1876. Find 
the correlation between birth and death rates. 


Period 

Mean annual 

Live birth rate 
per LOOO of population 

Mean annual 
death rate 

per LOOO of population 

1876-80 

35*3 

20-8 

1881-85 

33-5 

19-4 

1886-90 

31-4 

18-9 

1891-95 

30*5 

18-7 

1896-1900 

29*3 

17-7 

1901-1905 

28-2 

16-0 

1906-1910 

26-3 

14-7 

1911-15 

23-6 

14-3 

1916-20 

20-1 

14*4 

1921-25 

19-9 

12*2 

1926-30 

16-7 

1 

12-1 


9.5 The following figures (S. Rowson, Joum. Roy. Sta. Soe., vd. 99, 193^, 
give the relationship between the density of population and seating capacity 
of cinemas in various districts of Great Britain. 

Find the correlation between density of population and proportkm of 
cinemas with (1) seating capmty 500 or less, (2) seating capacity 2,000 
or more. 
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District 

Density of 
population 
per square mile 

Percentage of cinemas 

(1) 

Seating 500 
or less 

(2) 

Seatii^ 2,000 
or more 

Scotland .... 

163 

13-4 

4-3 

North Wales .... 

165 

42*5 

00, 

West of England . 

380 

38-2 

2-11 

Eastern Counties . 

431 

38*8 

1-3 

South Wales .... 

440 

22*4 

l-2\ 

North of England 

487 

16*0 

1-21 

Yorkshire and district . 

594 

15*5 

3-1 \ 

Midlands .... 

710 

20*2 


Home Counties (excl. London) 

794 

28*2 


Lancashire .... 

2,157 

13*5 

1 



9.6 Show that the coefiSicient of correlation is the geometric mean of the 
coefficients of regression; verify from the data of Examples 9.1, 9.2 
and 9.3 that the arithmetic mean of the coefficients of regression is 
greater than the coefficient of correlation. 

9.7 The tangent of the difference of angles A and B is given by — 


tan {A—B) 


tan A —tan B 
1+tan ^ tan 


Deduce that the smaller angle between regression lines is d, given by— 


tan 


1 — /• Ota, 

r a»*+ay* 


and interpret this result when r =0 and r =±1. 
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The Uvariate normal surfoce 

10.1 Our study of the normal curve in Chapter 8 may be extended 
to yield a corresponding expression for the frequency-distribution of pairs 
of values of two variates. This bivariate normal distribution, known also 
as " the bivariate normal surface," " the normal correlation surface " or 
simply " the normal surface," occupies a central position in the theory 
of bivariate frequency-distributions, and bears to them a relation similu 
to that borne by the normal curve to the frequency-distributions of a 
single variate. 

The norihal surface is of great historical importan(%, as the earlier 
work on correlation is, almost without exception, based on the assumption 
of such a distribution ; though when it was recognised that the properties 
of the correlation coefficient could be deduced, as in Chapter 9, without 
reference to the form of the distribution of frequency, a knowledge of this 
special t)q)e of frequency-surface ceased to be so essential. But the 
generalised normal law is of importance in the theory of samplii^ : it 
serves to describe very approximately certain actual distributions (e.g. of 
measurements on man) ; and if it can be assumed to hold good, some of the 
expressions in the theory or correlation, notably the standard deviations 
of arrays (and, if more than two variables are involved, the partial correla- 
tion coefficients), can be assigned more simple and definite meanings than 
in the general case. The student should, therefore, be f amiliar with the 
more fundamental properties of the distribution. 

10.2 Consider first the case in which the two variables are con^letely 
independent. Let the distributions of frequency for the two vaiialdes 
Xy and X} singly be given by 

Then, assuming independence, the frequency-distributions of pairs of 
values must, by the rule of independ«ice, be given by 

(IM) 

*S7 



23^ 

where 
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y H= 


yiyt_ N 
N 2iirOj(r, 


(10.3) 


Equation (10.2) gives a normal correlation surface for one special case, the 
correlation coefficient being zero. If we put x,=a constant, we see that 
every section of the surface by a vertical plane parallel to the ^r^-axis, i.e. 
the distribution of any array of z^’s, is a normal distribution, with thd same 
meam and standard deviation as the total distribution of x^’s ; and a sunilar 
statement holds for the arrays of x^s ; these properties must hold good, 
of course, as the two variables are assumed independent (cf. 3.18). iThe 
contour lines of the surface, that is to say, lines drawn on the surface Wt a 
constant height, are a series of similar ellipses with major and minor a^xes 
parallel to the axes of x^ and x, and proportional to Oj and a,, the equations 
to the contour lines being of the general form 



(10.4) 


Pairs of values of and Xj related by an equation of this form are, therefore, 
equally frequent. 

10.3 Now suppose we have two correlated variates x^ and Xj, and let 
the regression of x^ on x, be and that of Xg on x^ be Let fj, be the 
coefficient of correlation between Xj and Xj. 

Consider the new variates defined by the equations 

H.l=Xj— Jj^Xj 

This is a notation which we shall later extend considerably. 

Then x^ and x^^ are uncorrelated, as are x, and x^.,. > 

For 

S(*i*i.i) =2{x,(x,-f.„xi)} 

=S(X|X,) — ftMS(Xi)* 

«0 


and similarly for S(x^i. J. 

Writing Oj, o, for tte standard deviations of Xj, Xg, we see that the 
standard deviation of X|,g is given by 
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m 




={ajf-26,8riaCTiCT,+6J,CT*} 

={a»-2r**a*+ff,(T*} 

=af(l-rf,) 


and similarly 9|,i the standard deviation of ^ is given by 


o|.i=a|(l-ff,) 


Wp obtained these results in a slightly different form in 9.22 and 9.24. 


10.4 Suppose further that and ^ are not only uncorrelated, but 
independent, and that each is normally distributed. 

In accordance with equation (10.2), we must have for the frequency- 
distribution of pairs of deviations of x^ and x^ ^ 


But 


yi%^y\t exp 



(10.5) 


<^1 <^1.1 <^l{l 


2r 


cri., 




XyX. 


^*,1 


1% 


1^2 




Evidently we should also have arrived at precisely the same expression 
if we had taken the distribution of frequency for x, and and reduced 
the exponent 

I ^1. 


-^i.a 


We have, therefore, the general expression for the normal correlation 
surface for two variables — 

y„=y',.exp{-t(^+^-2r„-?l^)l . . (10.6) 

Further, since and *, and *i.,, are independent, we must have : 

N N N 


y\f- 


’2fraiO,.i 2«r9,9,,, 299^9,(1 — r J,)* 


(10.7) 


ExjNressing 9 ,,, and 9 ,,^ in terms of 9 ^, 9, and we have the 
alternative form 

„pj_ .„0J) 

2TO,®,Vl-r}, *^1 ffi®, <V) 
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Properties ttf iiie nomial smfiice 

10.5 For any given value A, of x, the distribution of the array of s^'s 
is given by 






This is a normal distribution of standard deviation with a i^fiiean 
deviating by r^i— A, from the mean of the whole distribution of x^'s. \ 
Hence, since A, may be any value, we have the important results — 


Aores of Measurement 



M • Mean of whoh surface 
and is also the summit of 
the surface 

R^XCrLines of means 



Contour fines and Axes of 
normal correlation surface 


I V \ 


np, lOJU-^Madpal am and o^iwr Itaws of fha aomd 
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(1) that the standard deviations of all arrays of are the same, and 
equal to Oi,, ; 

(2) that the regression of on x, is strictly linear. 

Similarly, it follows that the s.d.’s of all arrays of x^ are equal to 
and that the regression of x, on x^ is linear. 

10.6 The contour lines are, as in the case of independence, a series 
of concentric and similar ellipses ; the major and minor axes ar£^, however, 
no longer parallel to the axes of x^ and x^, but make a certain angle with 
them. Fig. 10.1 illustrates the calculated fonn of the contour lines far 
one case, RR and CC being the lines of regression. As each line of re- 
gression cuts every array of % or of Xf in its mean, and as the distributitm 
of every array is symmetrical about its mean, RR must bisect every 
horizontal chord and CC every vertical chord, as illustrated by the two 
chords shown by dotted lines ; it also follows that RR cuts aU the ellipses 
in the points of contact of the horizontal tangents to the ellipses, and CC in 
the points of contact of the vertical tangents. The surface or solid itself, 
somewhat truncated, is shown in hg. 9.1, page 208. 

10.7 Since, as we see from fig. 10.1, a normal surface for two correlated 
variables may be regarded merely as a certain surface for which r is zero 
turned round through some angle, and since for every angle through which 
it is turned the distributions of all Xi arrays and Xf arrays are normal, it 
follows that every section of a normal surface by a vertical plane is a normal 
curve, i.e. the distributions of arrays taken at any angle across the surface 
are normal. 

10.8 It also follows that, since the total distributions of x^ and x, must 
be normal for every angle through which the surface is turned, the 
distributions of totals given by slices or arrays taken at any angle across a 
normal surface must be normal distributions. But these would give the 
distributions of functions like ax^ db and consequently (1) the dis- 
tribution of any linear function of two normally distributed variables Xi 
and Xf must also be normal ; (2) the correlation between any two linear 
functions of two normally distributed variables must be normal corrdatum. 

Result (1) is very important, and may easfly be extended to 
cover the case of n variables x^ ... x». Suppose, in fact; we have 
n such variables each of which is normally distributed, and a Unear 
function axi+bxfi- . . . -j-hxM. Since ax^+ix, is normally distribute 
(axi+bx^ +CX, is normally distributed, and hence so is 
and so on. Thus the function aXi+ . . . +kx» is ncnrn^y distribute 

Hence, the sum of n normal variates is ^tiibuted normally ; and in 
particular the mean of n normal variates is distributed nmmally. Iforo 
particularly still, the means of samples of n from a ncnmal pqpulatiiim are 
normally ^tributed. 
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10.9 Returning to the normal surface, it is interesting to inquire what 

is the angle 0 through which the surface has been turned from the position 
for which the correlation was zero. The major and minor axes of the 
ellipses are sometimes termed the principal axes. If be the co- 

ordinates referred to the principal axes (the ^^-axis being the Xj-axis in 
its new position), we have for the relation between Ij, x^, the angle 
6 being taken as positive for a rotation of the Xj-axis which will make it, 
if continued through 90°, coincide in direction and sense with the x^-axis, 

cos 0+x, sin d] 
f 2 =x, cos d—x^ sin 0J 

But, since are uncorrelated, =0. Hence, multiplying toge^ther 

equations (10.9) and summing, \ 

0 =(o- 2 *— Oj*) sin 20+2fi,cTiO, cos 26 

. . ( 10 . 10 ) 

It should be noticed that if we define the principal axes of any distribution 
for two variables as being a pair of axes at right angles for which the 
variables ^2 uncorrelated, equation (10.10) gives the angle that they 
make with the axes of measurement whether the distribution be normal 
or not. 

10.10 The two standard deviations, say Si and Sg, about the principal 
axes are of some interest, for evidently from 10.2 the major and minor 
axes of the contour ellipses are proportional to these two standard 
deviations. They may be most readily determined as follows. Squaring 
the two transformation equations (10.9), summing and adding, we hav» — 

.... ( 10 . 11 ) 

Referring the surface to the axes of measurement, we have for the central 
ordinate, by equation (10.7), 



2wa,a,(l-»-J,)* .. 

Referring it to the principal axes, by equation (10.3), 

, _ _N_ 

^ ”~2jrS,S, 

But these two values of the central ordinate must be equal, therefore 

S,Sg=a,<T,(l-r*,)* . . . (10.12) 

(10.11) and (10.12) are a pair of simultaneous equations from which S, and 
S, may be very simply obtained in any arithmetical case. Care moat, 
however, be taken to give the correct signs to the square root in stdving. 
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Sj+St is necessarily positive, and Sj— S* also if r is positive, the major 
axes of the ellipses lying along ; bnt if r be negative, Sj— S( is also 
negative. It should be noted that, while we have deduced (10.12) from 
a simple consideration depending on the normality of the distribution, it 
is really of general application (like equation (10.11)), and may be obtained 
at somewhat greater length from the equations for transforming co- 
ordinates. 

10.11 As an example of the application of the foregoing theory to a 
practical case, we proceed to consider the distribution of Table 9.3, 
page 202, showing the correlation between stature of father and son, and 
to test, as far as we can by elementary methods, whether a normal surface 
will fit the data. 

10.12 The first important property of the normal distribution is the 
linearity of regression. This was well illustrated for these data in fig. 9.8 
(page 218). Subject to some investigation as to the deviations from strict 
Unearity which may occur as the result of sampling fluctuations, we may 
conclude that the regression is appreciably Unear. We shaU consider a 
test of linearity in later chapters (see Chapter 21). 

10.13 The second important property is the constancy of the standard 
deviation for all parallel arrays. 

The standard deviations of the ten columns from that headed 62 *5-63 *5 
onwards are — 

2-56 2-60 

211 2-26 

2-55 2*26 

2-24 2*45 

2-23 2-33 

the mean being 2*36. The standard deviations again only fluctuate 
irregularly round their mean value. The mean of the first five is 2*34, of 
the second five 2*38, a difference of only 0*04 ; of the first group, two are 
greater and three are less than the mean, and the same is true of the second 
group. There does not seem to be any indication of a general tendency 
for the standard deviation to increase or decrease as we pass from one end 
of the table to the other. We are not yet in a position to test how far the 
differences from the average standard deviation might have arisen in 
sampling from a record in which the distribution was strictly normal, but, 
as a fact, a rough test suggests that they might have done so. 

10.14 Next we note that the distributions of all arrays of a normal 
surface should themselves be normal. Owing, however, to the small 
numbers of observations in any array, the distributions of arrays are very 
irregular, and their normaUty cannot be tested in any very satisfactory 
way ; we can only say that they do not exhibit any marked or regular 
asymmetry. But we can test the allied property of a normal correction 
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table, viz. that the totals of arrays must give a normal distribution even 
if the arrays be taken diagonally across the surface, and not parallel to 
either g-yis of measurement. From an ordinary correlation table we 
cannot find the totals of such diagonal arrays exactly, but the totals of 
arrays at an angle of 45“ will be given with sufficient accuracy for our 
present purpose by the totals of lines of diagonally adjacent compartments. 
Referring again to Table 9.5, and forming the totals of such diagonals 
(running up from left to right), we find, starting at the top left-hand fcomer 
of the table, the following distribution — 


0*25 

78*75 

2 

81*25 

3-25 

66*5 

6*25 

59*25 

8 

42*25 

9*75 

30*75 

17 

29*25 

34*5 

19 

42 

10*75 

46*25 

7 

60*5 

4*25 

67*5 

3*5 

85*75 

1*75 

87*25 

1 

78 

0*25 

94*25 


Total 

1078 


The mean of this distribution is at 0’359 of an interval above the centse of 
the interval with frequency 78 ; its standard deviation is 4 • 757 interVials, or, 
remembering that the interval is 1 /V2 of an inch, 3*364 inches. (This 
value may be checked directly from the constants for the table given in 
Exercise ^3, page 235, for we have, from the first of the transformation 
equations (12.9), 

Og*=ai* cos* d-fOj* sin* 5-J-2fj,aiO, sin 0 cos 0 

and inserting ai=2*72, a, =2*75, f„=0*51, sin ^=cos ^«1/V2' find 
Cg=3*361.) Drawing 4 diagram and fitting a normal curve, we have 
fig. 10.2 ; the distribution is rather irregular but'the fit is fair ; certainly 
ttere is no marked asymmetry, and, so far as the graphical test goes, the 
distribution may be regarded as appreciably normal. One of the greatest 
divergences of the actual distribution from the normal curve occurs in the 
almost central interval with frequency 78 ; the difference between the 
observed and calculated frequencies is here 12 units, but neverthdess it 
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Fig. 10.2. — ^Diitribtttion of frequency obtained by addition of Table 9.3 along diagonal* 
running up from left to right, fitted with a normal curve 


may well have occurred as a fluctuation of sampling. In fact, anticipating 
our discussion of the use of the standard error (standard deviation of 
simple' sampling) in testing the significance of sampling fluctuations 
(17.4), we may note that the standard error in this case is Vnpq, where 
« is the number of observations and p and q the chances of an individual 
falling or not falling within the given interval, p may be taken as 90 /1078, 
and therefore the standard error is 


/ 90 988 

V^°^«-i078*i078=®-^ 

The observed deviation, 12, is not much greater than this and may there- 
fore have occurred as a sampling fluctuation. We have used here the 
exact expression for the standard error, but since p is small we might 
have used the approximation V^n=V90=9'5. This last is useful as 
giving a test which can be applied on sight. 

10.15 So far, we have seen (1) that the regression is aimroximately 
linear ; (2) that, in the arrays which we have tested, tbe standard 
deviations are approximately constant, or at least that their diflerences 
are only small, irregular and fluctuating; (3) that the distribution 'of 
totals for one set of diagonal arrays is approximately normaL These 
results suggest, though they cannot cinnpletely prove, that the whole 
distribntion of frequency may be regarded as approximately normal* 
within the limits of fluctuations of sampling. We may therefore apfdy a 
more searching test, viz. the form of the contour lines and the closeness 
of their fit to the contour ellipse of the normal surface. It may, howmrer. 
be seen that no very close fit can be expected. Since the frequencies in 
tbe compartments of the table are small, the standard error id any 
frequency is g^vea apiuoximat^y its square root (17JL5), and this 
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implies a standard error of about 5 units at the centre of the table, 3 units 
for a frequency of 9, or 2 units for a frequency of 4 ; fluctuations of these 
magnitudes are quite possible and might cause wide divergences in the 
corresponding contour lines. 

10.16 Using the suf&x 1 to denote the constants relating to the distribu- 
tion of stature for fathers, and 2 the same constants for the sons, 

iV=1078 Mi=67-70 Af,=68-66 

CTi=2-72 ct,=2-75 

Hence we have from equation (10.7), 

y'„=26-7 


and the complete expression for the fitted normal surface is 


=26’7 exp 





The equation to any contour ellipse will be given by equating the index 
of « to a constant, but it is very much easier to draw the ellipses if we refer 
them to their principal axes. To do this we must first determine 6, S, 
and Sj. From (10.10), 

tan 2fl=— 46'49 

whence 2d=91® 14', 0=45® 37', the principal axes standing very nearly 
at an angle of 45® with the axes of measurement, owing to the two standard 
deviations being very nearly equal. They should be set off on the diagi^itfn, 
not with a protractor, but by taking tan d from the tables (1.022) and 
calculating points on each axis on either side of the mean. 

To obtain Sj and S, we have, from (10.11) and (10.12), 

5i*+S,»=14-961 

2S,S,=l-868 

Adding and subtracting these equations from each other and taking the 
square root, 

S,4-S,=5-275 
S,-S, =1-447 

whence Si=3-36, S,=l -91 ; owing to the principal axes st anding nearly 
at 45® the first value is sensibly the same as that found for Og in 10 . 14 . 

The equations to the contour ellipses, referred to the principal axes, may 
therefore be written in the form — 
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the major and minor semi-axes being 3 • 36 x c and 1 • 91 x c respectively. To 
find c for any assigned value of the frequency y we have — 

yxt^y'xif^ 

2(logy'i,-logyi,) 

log« 

Supposing that we desire to draw the three contour ellipses for y=5, 
10 and 20, we find c=l -83, 1 -40 and 0-76, or the following values for the 
major and minor axes of the ellipses ; semi-major axes, 6- 15, 4*70, 2-55 ; 
semi-minor axes, 3-50, 2-67, 1-45. The ellipses drawn with these axes 
are shown in fig. 10.3, very much reduced, of course, from the original 



Ftg. 10.3.— Contour Hae* for the frequendct 5, 10 and 20 of die dMdlntlimi of Talilo 
11 corrc^ondinO contour d^jwet of the fitted normal surfiwe 

PjPj, P^Pt, principal axes : M, mean. 

drawing, one of the squares shown representing a square inch on the 
original. The actual contour lines for the same frequoicies are ^own 
by the irregular polygons saperp(»ed on the ellipses, the points on these 
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polygons having been obtained by simple graphical interpolation between 
the frequencies in each row and ea^ column — diagonal interpolation 
between the frequencies in a row and the frequencies in a t^lnmn not 
being used. It will be seen that the fit of the two lower contours, is on 
the whole, fair, especially considering the high standard errors. In the 
case of the central contour, y—20. the fit looks very poor to the eye, but 
if the ellipse be compared carefully with the table, the figures suggest 
that here again we have only to deal with the effects of fluctuatibns of 
sampling. For father's stature=66 in., son’s stature=70 in., there is a 
frequency of 18-75, and an increase in this much less than the st^dard 
error would bring the actual contour outside the ellipse. Agam, for 
father's stature =68 in. son’s statute =71 in., there is a frequency oi 19, 
and an increase of a single unit would give a point on the actual contour 
below the ellipse. Taking the results as a whole, the fit must be considered 
quite as good as we could expect with such small frequencies. 

Isotropic character of flic normal surffice 

10.17 The normal distribution of frequency for two variables is an 
isotropic distribution, to which all the theorems of 3.16 apply. For 
if we isolate the four compartments of the correlation table comiqon 
to the rows and columns centring round values of the variables x^, 

we have for the ratio of the cross-products (frequency of X|X| 
multiplied by frequency of divided by frequency of x^x,’ multiplied 
by frequency of x^x^), 






Assuming that Xi'— Xi has been taken of the same sign as x,'--x„ the 
exponent is of the same sign as fjj. Hence, the association for this group 
of four frequencies is also of the same sign as rjj, the ratio of the cross- 
products being unity, or the association zero, if is zero. In a normal 
distribution, the association is therefore of the same sign — ^the sign of 
''ll every tetrad of frequencies in the compartments common to 
two rovra and two columns ; that is to say, the distribution is isotropic. 
It follows that every grouping of a normal distribution is isotropic whether 
the das^intervals are equal or unequal, large or small and the sign of the 
association for a normal distribution grouped down to 2 x 2-fold form 
must alwa 3 rs be the same whatever the axes of division chosen. 

10.18 These theorems an of importance in the applications of tl» 
of normal correlation to the treatment of qualitative characters 
which are subjected to a manifold classification. The contmgency taldes 
Sl*^*-**"?^*®” sometimes regarded as groupings of a normal 
^tnbufron of freq^cy, and the coefficient of conelatioa is detennined 
on this hypothesis by » special procedure (see bdowy UJt9, page 268 ). 
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Before applying this procedure it is well, therefore, to see whether the 
distribution of frequency may be regarded as approximately isotropic, 
or reducible to isotropic form by some alteration in the order of rows 
and columns (3.16 and 3.17). If only reducible to isotropic form by 
some rearrangement, this rearrangement should be effected before grouping 
the table to 2 x 2-fold form for the calculation of the correlation coefficient 
by the process referred to. If the table is not reducible to isotropic 
form by any rearrangement, . the process of calculating the coefficient 
of correlation on the assumption of normality is to be avoided. Clearly, 
even if the table be isotropic it need not be normal, but at least the test 
for isotropy affords a rapid and simple means for excluding certain dis- 
tributions which are not even remotely normal Table 3.2, page 50, 
might possibly be regarded as a grouping of.normally distributed frequency 
if rearranged as suggested in 3.15 — ^it would be worth the investigator's 
while to proceed further and compare the actual distribution with a fitted 
normal distribution — ^but Table 3.4 could not be regarded as normal, and 
could not be rearranged so as to give a grouping of normally distributed 
frequency. 

10.19 If the frequencies in a contingency table be not large, and also 
if the contingency or correlation be small, the influence of casual irregu- 
larities due to fluctuations of sampling may render it difficult to say 
j whether the distributicm may be regarded as essentially isotropic or 
not. In such cases some further condensation of the table by grouping 
together adjacent rows and columns, of some process of " smoothing " 
by averaging the frequencies in adjacent compartments, may be of service. 
The correlation table for stature in father and son (Table 9*3), for instance, 
is obviously not strictly isotropic as it stands : we have seen, however, 
that it appears to be normal, within the limits of fluctuations of sam|fling, 
and it should consequently be isotropic within such limits. We can 
apply a rough test by regrouping the table in a much coarser form, say 
with four rows and four columns : the table below exhibits such a grouping, 


TABLE 10.1— (CondoiMd ftom TaUe 9.3. ^ 202) 


Son's stature 
(indtes) 

Under 

65*5 

Father’s stature (inches) 

65-5-07-5 e7*5^*5 

ano over 

Total 

Uiid«r66*5 

68*5*70*5 
70*5 and ovur 

07*5 

76*5 

83*25 

14*75 

74-25 

108 

64*75 

32*5 

34*75 

85 

05 

80*75 

10*5 

52 

84*5 

134 

217 

321*5 

277*5 

m 

Total 

m 

m-s 

29SS 

281 

1.078 
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the limits of rows and of columns having been so fixed as to include not 
less than 200 oljservations in each array. 

Taking the ratio of the frequency in column 1 to the sum of the frequencies 
in columns 1 and 2 for each successive row, and so on for the other pairs of 
columns, we find the following series of ratios — 


TABLE 10.2— Ratio of frequency in oRunui m to freqnency in column m plus ft^neney 
in odunm (m+1) of Table 10.1 \ 


Row 

1 and 2 

Columns 

2 and 3 

3 and 4 

1 

0*568 

0*681 

0*768 

2 

0*415 

0*560 

0*620 

3 

0*339 

0*405 

0*529 

4 

0*312 

0*287 

0*376 


These ratios decrease continuously as we pass from the top to the bottom 
of the table, and the distribution, as condensed, is therefore isotropic. 
The student should form one or two other condensations of the original 
table to 3- x 3- or 4- x 4-fold form : he will probably find them either 
isotropic or diverging so slightly from isotropy that an alteration of the 
frequencies, well within the margin of possible fluctuations of sampling, 
will render the distribution isotropic. ^ 

Relatioiish4> betweep contfaigency and normal corrdation 
10.20 It was shown by Karl Pearson that if a normal bivariate population 
is divided into sections so as to form a contingency table, the coefficient 
of mean square contingency, C, tends to the value r in magnitude as the 
intervals become finer and finer, though of course it is always positive 
in sign. It was, in fact, the relation 



where is the mean-square contingency, which led Pearson to identify 
C with the expression on the right. 

The values of C and r for the distributions of some of the tables of 
Chapter 9 were compared in Exercise 9.3, page 235. 
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SUMMARY 

1. The equation of the normal surface is 


^11=, 


N __ 

2jr(TiajVl — rf, 


exp 


.__i A? 

2(l-^i*)W ^icr, <T|j) 


where is the s.d. of Oj that of and the correlation between 
Xi and X 2 > 

This may also be written 


where 


:Vi2=- 


Ny/l -r* 
exp 



2r^^yX^ j \ 






^12) 


2. For two variates normally correlated the standard deviations of 
parallel arrays are equal and the regressions are linear. 

3. Any section of the normal surface by a vertical plane is a normal 
curve, and a section by a horizontal plane is an ellipse. The ellipses given 
by horizontal sections are similar and similarly situated. 

4. The bivariate normal distribution is isotropic. 

5. A linear function of variates, each of which is normally distributed, 
is also normally distributed. 


EXERCISES 

10.1 Deduce equation (10.12) from the equations for transformation of 
co-ordinates without assuming the normal distribution. 

10.2 Hence show that if the pairs of observed values of x, and x, are 
represented by points on a plane, and a straight line drawn through the 
mean, the smn of the squares of the distances of the points from this line 
is a minimum if the line is the major principal axis. 

10.3 The coefficient of correlation with reference to the principal, axes 
being zero, and with reference to other axes something, there must be 
some pair of axes at right angles for which the correlation is a maximum, 
i.e. is numerically greatest without regard to sign. Show that these axes 
make an angle of 45° with the principal axes, and that the maximum 
value of the correlation is 
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10.4 (Sheppard, Phil. Trans. Soc. A, 1898, 192, 101.) A fourfold 
table is formed from a normal correlation table, taking the points of 
division between A and a, B and fi, at the medians, so that {A)—{a)==*{B} 
as{B) 12. Show that 

f =003^^1 j^jn 

10.5 Show that the points of inflection of the sections of the imnnal 

surface by vertical planes through the mean of the distribution lie on an 
ellipse ; and show how this ellipse may be used to give the standard devia> 
tions of such sections. \ 

i 

10.6 Hence find the minimum and maximum standard deviations WUch 
can be taken by such sections, and show that any specified value of\the 
s.d. between the minimum and maximum ■will be given by two, and only 
two, sections. 

10.7 Assuming that the heights of fathers and sons are distributed 
in the bivariate normal form with a correlation which is positive but not 
unity and with the same means and variances, show that fathers of more 
than average height tend to have sons whose height, though above average, 
is less than that of their respective fathers. Show also that sons of more 
than average height tend to have fathers whose height is less than that 
of their respective sons. Explain why these two results are not in- 
consistent. 

10.8 Find the conditions that the surface 

z=k exp {ax*+2hxy+by*) 

can represent a normal correlation surface whose variates are x and y. 
Assuming these conditions satisfied, express Oj, a, and fu in terms of 
a, h and b. 

10.9 Corresponding to x-values, — n, — («— 1) . . . —1, 0,.l, . . . («— 1), n, 
the y-values are the cubes of the ac-values. Show that the covariance 
(9.25) of X and y is given by 

2«» 

/‘u— + lower powers of n. 

Hence show that for large n the correlation is approximately y'0'84* 
0*916 and thus is not unity although the variates are functionally related. 

10.10 In a bivariate normal population the standard deviation of any 
x-array is k times that of the x-variate as a whote. Show that the conela* 
tionis 
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Mettiods of estimating the {Kroduct-moment correlation coefficient 

11.1 The only strict method of calculating the correlation coefficient 
is that described in Chapter 10, from the formula 

Vt{x*)Z(y*) 

Where possible this formula should be employed. It sometimes happens, 
however, owing to incomplete data, that we are constrained to use some 
method of approximation. Furthermore, the large amount of arithmetical 
labour involved in applying the ordinary formula may sometimes be 
avoided by approximations which are sufficiently accurate for the purpose 
in view. We therefore proceed to give a few methods of this kind. They 
are not recommended for general use as they will, a^ a rule, lead to different 
results in different hands. 

11.2 (1) The means of rows and columns are plotted on a diagram, 
and lines fitted to the points by eye, say by shifting about a stretched black 
thread until it seems to run as near as may be to all the points. If be 
the slopes of these two lines to the vertical and the horizontal respectively, 

r—Vbibf 

Hence the value of r may be estimated from any such diagram as fig. 9.8 
or 9.9, in the absence of the original table. Further, if a correlation taUe 
be not grouped by equal inteiA'als, it may be difficult to calctdate the 
product sum, but it may still be possible to plot approximately a diagram 
of the two lines of regression, and so determine roughly the value of r. 
Similarly, if only the means of two rows and two columns, or of one row and 
one column in addition to the means of the two variables, are known, it wili 
still possible to estimate the slopes of RR and CC. and hence the coirda* 
tion coefficient. 

(2) The means of one set of arrays only, say the rows, are calculated, 
and also the two standard deviations a« and Oy. The means are then 
ld<rtted on a diagram, uang the standard deviation of each vaziaUe as the 
unit of measurimmit, and a line fitted by eye. The slope of this liim to the 
verticdisf. If the standard denations be not used as the units of measore- 


•53 
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ment in plotting, the slope of the line to the vertical is rosltfy, and hence 
r will be obtained by dividing the slope by the ratio of the standard 
deviations. 

This method, or some variation of it, is often nseful as a makeshift when 
the data are too incomplete to permit of the proper calculation of the 
correlation, only one line of regression and the ratio of the dispersions of 
the two variables being required : the ratio of the quartile deviations, or 
other simple measures of dispersion, will serve quite well foil rough 
purposes in lieu of the ratio of standard deviations. As a special cape, we 
may note that if the two dispersions are approximately the saine. the 
slope of RR to the vertical is r. \ 

Plotting the medians of arrays on a diagram with the quartile deviations 
as units, and measuring the slope of the line, was the method of dper- 
mining the correlation coefficient used by Galton, to whom the introduction 
of such a coefficient is doe. 

(3) If be the standard deviation of errors of estimate like x—b^y, 
we have, from 9.24, 

s»*=a»*(l — r*) 

and hence. 



But if the dispersions of arrays do not differ largely, and the regression is 
nearly linear, the value of Ss may be estimated from the average of the 
standard deviations of a few rows, and r determined — or rather estimated 
— accordingly. Thus in Table 9.3 the standard deviations of the. ten 
columns headed 62-5-63'5, 63'5-64-5, etc., are — 


2*56 

2*26 

2*11 

2*26 

2*55 

2*45 

2*24 

2*33 

2*23 

— 

2*60 

Mean 2*35 


The standard deviation of the stature of all sons is 2*75 : hence apinx>xi* 
mately 



=0*514 


This is the same as the value found by the product-sum method to the 
second decimal place. It would be better to take an average by counting 
the square of each standard deviation once for each observation in the 
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column (or " weighting ” it with the number of observations in the column), 
but in the present case this would only lead to a very slightly different 
result, viz. s=2’362, fsO'512. 

Non*liiiear regression 

11.3 We referred in Chapter 9 to the fact that the treatment of cases 
when the regression is non-linear is somewhat difficult. We may, by 
the methods of Chapter 15, and otherwise, fit corves of any order to the 
means of arrays, just as we have fitted straight lines to them ; but the 
handling of these regression curves and their interpretation is far more 
complicated. 

11.4 It is therefore desirable, wherever possible, to deal with variates 
which result in linear regression. Now it sometimes happens that if a 
rdation between X and Y be suggested, we may, either by theory or by 
previous experience, throw that relation into the form 

Y=^A->rB<l>(X) 

where A and B are the only unknown constants to be determined. If 
a correlation table be then drawn up between Y and instead of Y 
and X, the regression will be approximately linear. Thus in Table 9.5, 
page 205, if X be the rate of discount and Y the percentage of reserves 
on deposits, a diagram of the curves of regression suggests that the 
relation between X and Y is approximately of the form 

X(Y-B)=A 

A and B being constants ; that is, 

XY=A+BX 

Or, if we make XY a new variable, say Z, 

Z=A+BX 

Hence, if we draw up a new correlation table between X and Z the 
regression will probably be much more closely linear. 

If the relation between the variables be of the form 

Y^AB^ 


we have 

log y=log A +X log B 

and hence the relation between log Y and X is linear. Similarly, if the 
relation be of the form 

X*Y^A 


we have 


log Y ssslog .4--n log X 
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and so the relation between log V and log X is linear. By means of 
such artifices for obtaining correlation tables in which the regression is 
linear, it may be possible to do a good deal in difficult cases whilst iwing 
elementary meth(^ only. 

The correlafion ratios 

11.5 In view of the importance of linearity of regression it is desirable 
to have some criterion which will enable a judgment to be formed whether 
a regression is, within the limits permitted by sampling fluctuations, 
linear in any given case. We now proceed to discuss a coefficient deigned 
for this purpose. 

Consider a bivariate frequency table, and let spt be the sta 
deviation of the /th array of X’s. Let np be the number of observations 
in this array. 

Let 

.... ( 11 . 1 ) 

Then a*«r is the weighted mean of the variances of arra 3 rs, obtained as 
suggested in the last sentence of 11.2 (3). Now, let 

o*«»=o*»(l — . . (11*2) 

or 

yV=l-^ . (11.3) 

Then 9^9 is called the correlation ratio of X on Y. Similarly, 9 ^, 
defined by 



is called the correlation ratio of Y on X. 

11.6 The correlation ratios may be put in another form, which is much 
more convenient for purposes of calculation. 

In fact, if Mx is the mean of all the X‘% and mp* the mean of an array, 
we have, as in equation ( 6 . 6 ), 

No'*»=2(%{s‘#» + (Mx—mpx) •} ) 

or, using Omx to denote the standard deviation of mpt, obtained by 
“ weighting ” each mp, according to np, the number of observations in 
the array in which it occurs, 

a«,a=a*«.+a*, 

Hence, substituting in ( 11 . 3 ), 


. (11.4) 
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The correlation ratio of Z on Y is therefore determined e^en we have 
found the standard deviation of X and the standard deviation of the 
means of its arrays. 

11.7 In 9J2 we saw that 

( 11 . 6 ) 


where x—b^y^O is the line of regression of * on y, * and y being the 
values of X and Y measured from the mean of the distribution. 

Now, for any array for which y is constant, 

1 1 * 




the product term vanishing since T,(x —mpx) —0. Hence, summing for all 
arrays of y, 

But 

o-(l_,*p=a^ 

Hence, 

From this we see that cannot be less than r in absolute value. 

If then 

J^{np[mt,—bjy)*) =0 
Le. 

mpt-~bjy=i0 


for aB arrays. This means that the mean mpx must be on the line of 
nfieasion for all ariaya^ ie. tlmt the regression is linear. 
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11.8 The divergence of 7 * from f* therefore measures the departure 
of the regression from linearity. It should, however, be noted that 
sampling fluctuations may cause 17 *— r* to deviate from zero even when 
the regression is truly linear. We give later a method of testing the 
significance of observed fluctuations of this kind. 

Calculation of the correlation ratio 

11.9 The table on page 259 illustrates the form of the avithmetic 
for the calculation of the correlation ratio of son’s stature onlfather's 
stature (Table 9.3). In the first column is given the type of tM array 
(stature of father ) ; in the second, the mean stature of sons for that\ array ; 
in the third, the difference of the mean of the array from the mean stature 
of all sons. In the fourth column these differences are squared, ^nd in 
the sixth they are multiplied by the frequency of the array, two decimal 
places only having been retained as sufficient for the present purpose. 
The sum-total of the last column divided by the number of observations 
(1078) gives a*<iiy=2-058, or ori«y=l*43. As the standard deviation of 
the sons’ stature is 2*75 in., ^yx=0-52. Before taking the differences for 
the third column of such a table, it is as well to check the means of the 
arrays by recalculating from them the mean of the whole distribution, 
i.e. multipl 5 dng each array-mean by its frequency, summing and dividing 
by the number of observations. The form of the arithmetic may be 
varied, if desired, by working from zero as origin, instead of taking differ- 
ences from the true mean. The square of the mean must then be 
subtracted from S(/»»*y) /N to give 

11.10 If the second correlation ratio for this table be worked out in 
the same way, the. value will be found to be the same to the second place 
of decimals : the two correlation ratios for this table are, therefore, very 
nearly identical, and only slightly greater than the correlation coefficient 
(0*51). Both regressions, as follows from the last section, are very nearly 
linear, a result confirmed by the di^am of the regression lines (fig. 9 . 8 , 
page 218). On the other hand, it is evident from fig. 9.10, page 220, 
that we should expect the two correlation ratios for Table 9.6 to differ 
considerably from each other and from the correlation coefficient. 

■Hie student should notice that the correlation ratio only affords a 
satisfactory test when the number of observations is sufficiently large for 
a grouped correlation table to be formed. In the case of a short series of 
observations such as that given in Table 9 7, page 207, the method is 
inapplicable. 

Rank cmrdatkm o>^dients 

11*11 In calculating the coefficient of correlation from the product- 
moment it is necessary that the data should be definitely measured. If 
they are not so measured we cannot, in general, determine the coefficient, 
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Example 11.1. — Calculation of the correlation ratio 
Sont’t ttatnre on father’! statute 

(Data of Table 9.8, page 202) 


1 

T 3 rpe of 
array 
(Father's 
stature) 

2 

Mean of 
array 
(Son's 
stature) 

3 

Difference 
from mean 
of all sons 
(68-66) 

4 

. Square of 
difference 

5 

Frequency 

6 

Frequency x 
(difference)* 

59 

64-67 

-3-99 

15-9201 

3 

47-76 

60 

65-64 

-3.02 

9-1204 

3-5 

31-92 

61 

66-34 

-2-32 

5-3824 

8 

43*06 

62 

65-56 

-3-10 

9-6100 

17 

163-37 

63 

66-68 

-1-98 

3-9204 

33-5 

131-33 

64 

66 ’74 

-1-92 

3 -6364 

61-5 

226-71 

65 

67-19 

-1-47 

2-1609 

95-5 

206-37 

66 

67-61 

-1-05 

1-1025 

142 

156*56 

67 

67-95 

-0-71 

0-5041 

137-5 

69-31 

68 

69-07 

+0-41 

0*1681 

154 

25-89 

69 

69-39 

+0-73 

0-5329 

141-5 

75-41 

70 1 

69-74 

+ 1-08 

1-1664 

116 

135-30 

71 

70-50 

+ 1-84 

3-3856 

78 

264-08 

72 

70-87 

+2-21 

4-8841 

49 

239-32 

73 

72-00 

+3-34 

11-1556 

28-5 

317-93 

74 

71-50 

+2-84 

8-0656 

4 

32-26 

75 

71-73 

+3-07 

9-4249 

5-5 

51*84 

Total 




1,078 

2,218-42 


<T* =2218-42 /1078=2 058 =1-43 

ij^=l-43/2-75=0-52 

though we may sometimes approximate to it by one of the methods of 

11 . 2 . 

But there may be more saious obstacles than imperfect grouping in 
the way of finding the correlation between two variates. In the examples 
we have considered up to the present the qualities we have discussed have 
been easily measurable, involving such familiar concepts as height, weight, 
age and so forth. In certain types of inquiry we may have to deal with 
qualities which are not expressible as numbers of units of an objective 
kind. 

11.12 Consider, for instance, the relation between mathematical and 
musical ability in a class of students. " Ability,” whether of a general 
or a specific kind, is a variate in the sense that it varies from one individual 
to another ; and it may be a numerical variate if we can decide on some 
unequivocid way of measuring it. A very common mode of attempting 
to do so is by allotting marks to each student. But such methods are open 
to many objections, not the least of which is that different examiners ^uld 
give different marks to the same person. A correlation between the marks 
obtained for mathematics and music would, therefore, be likely to depend 
to some extent on the examiner, and would not reflect accoratdy the 
Tclationahip between the two qu^ties. 
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11.13 Difficulties of this type disappear to some extent if we arrange 
the students in order of their ability, but do not attempt to assess it 
numerically. There will still be some divergence of opinion between 
different examiners, perhaps, but it will not as a rule be so serious. We 
then allot to each student a number which indicates his position in the 
arrangement according to ability, the first being number 1, the second 
numbCT 2, and so on. The students are then said to be ranked, |md the 
number of a particular individual is his rank (cf. 6.33), 

11.14 A procedure of this kind is useful in the treatment not ^y of 
data which can be ordered but not exactly measured, but of mea^able 
data also. For instance, we can easily rank a number of men according 
to height without actually measuring them. It is also comparatively easy 
to rank a number of shades of a colour, or a number of coimtries according 
to their importance in the export market, where precise numerical measure- 
ment would be very troublesome. 

In the extreme case we may have situations in which individuals can 
be ordered but not measured. Suppose, for example, we have a pack 
of cards in which a particular suit, say hearts, is in the correct order 
ace, two, , . . king. We then shuffle the pack and examine the order of 
the heart cards with the intention of discussing whether the shuffling 
process was a good one. The relationship between the orders before and 
after shuffling is evidently a possible basis of comparison ; but there is 
not even a theoretically measurable variate corresponding to " order " in 
this case. 

11.15 If we have a set of individuals ranked according to two different 
qualities it is natural to inquire whether the ranks can be made, to give 
us some measure of the degree of relation between the two qualities. 

Suppose we have n individuals, whose ranks according to quality A are 
Xi, Xf, X^, . . . Xh, and according to quality B are Fj, F,, F,, . . . F*, 
where the X's and F’s are merely permutations of the first n natural 
numbers. Let dk—Xk—Yk. 

The values of d form a convenient measure-of the closeness of the 
correspondence between A and B. If all the d’s are zero the correspond- 
ence is perfect, for an individual whose rank is Xk for A will also be Xk for B. 
We cannot, however, take the sum of the d’s as a measure of correspondence, 
because that sum is zero ; for the sum of the differences of the X's and F's 
is the difference of the sums of the X’s and the F’s, each of which is the sum 
of the first n natural numbers. 

A possible measure which suggest itsdf is the sum of the absolute values 
of the d’s, i.e. S|d| . This measure and its mean -Zldf have, in fact, been 

fl 

oaed, but like the mean deviation (6.1B) th^ have certain analytical 
disadvantages. 
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U.16 A more conveoient coefficient is obtained as follows — 

The values of X range from 1 to n. Their sum is and their 

*4~’l 

mean is accordingly — This value is also the mean of the Y’s. 

Mf I 1 

Let us denote by Xk the value of Xh — i.e. the divei^ence of X* 

M „.L 1 

from the mean. Similarly for yk, which we define as Yk- ~ 


Write 


VS(*»)L(y*) 


This is the product-moment coefficient of correlation between X and Y. 
We shall call p Spearman’s rank correlation coefficient. It may be 
expressed very simply in terms of n and the d’s. 

For, as we saw in 6.15, 2(x*)=S(y*)=^(«*— n) 


Hence, 


==E(X*- Y*)*=S(x-y)» 
=S{x*)-|-S(y»)-2L(*v) 


S(xy)=i 


and substituting in (11.8)- 


^ n*— « 


(U.9) 


Examjde 11.2. — ^The rankings of ten students in mathematics and 
music are as follows — 

Mathematics : 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 
Music : 6, 5, 1, 4, 2, 7, 8, 10, 3, 9 

What is the coefficient of rank correlation ? 

The differences d are (mathematical rank minus musical rank) 

-5, -3, +2, 0, -f3, -1, -1, -2, -f6. +l 

These add to zero, as they should. 

The squares of i are 

25, 9, 4, 0. 9, 1, 1, 4, 36, 1 

which add up to 90. 

Hence, from (11.9), 
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11.17 The rank correlation coefficient varies from -|-i to — I. If the 
rank correlation is perfect, all the d’s are zero. If, on the other hand, the 
ranks are such that the first, second, third in one order c<Hrespond to the 
»th, (fi— l)th, («— 2)th, ... in the other, p=— 1. The proof is slightly 
different according to whether n is even or odd. If it is odd, say=2m+l, 
the d’s are 


2m, 

and 


Hence, 


2ei-2, 2, 0, -2, -(2m-2), 

E{d*)=2{(2»»)*+{2«-2)»+ . . +2*} 

8i»(m+l)(2»»+l) 

6 


8«(»«+l)(2m+l) 

^ (2«»+l){(2«+l)»-l} 



If « is even, say =2w, 

!(</*) =2{(2»»-l)*+ . . . -f-1*} 


and 


2m 

3 


(4m*- 1) 


1 as before.* 


11.18 A second rank correlation coefficient which has certain advantages 
over Spearman’s may be obtained as follows : Consider again the data 
of Example 11.2, and consider the order of each possible pair in the two 
rankings. If any pair is in the same order in both we allot it the,.ecore 
•fl, if in the opposite order the score —1. For instance, of the pairs 
65, 61, 64, 62, 67 the first four are in the reverse order in the second 
ranking as compared with the first and each scores —1 ; the fifth, 67, is 
in the same order and hence scores +1 ; and so on. There are **Cs=45 
possible pairs. The maximum score, if both rankings are the same, is 
45. The minimum score, if one is the inverse of the other, is —45. In 
our present example the total score will be found to be 15. We then 
define a rank correlation coefficient r as 


r= 


Score 


Maximum possible score 


=!|-0-33 


^The property of varying between 1 and — 1 does not belong to a similar coefficient 

Itropoaed by Speannan. and known an his '* foot-rute," vi*. 

It may be ahoa-n in the above manner that R varies from —0*5 to 4- 1, and for this 
mson alone R seems an undesirable coefficients 
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11.19 Generally, if S is the score in a ranking of n we have 

_ S 


( 11 . 10 ) 


r may also be regarded, in a sense, as a product-moment correlation. 
Suppose that for any two ranks i, j, we allot the value 4-1 if t > j and 
—1 if * < j. Call this value a^, so that 


av = 



i > j) 
i <j] 


Similarly let b,j represent a corresponding quantity in the second ranking. 
We then have 


_ n 1 1 n 

for is merely the number of possible pairs 1) and the 

numerator is the score S as defined above. 

Example 11.3. — A set of 15 recruits are given a preliminary test to 
admit them to a course of training and, after the completion of training, 
a proficiency test. Their ranks are — 

Candidate . . ABCDEFGH IJ KLMNO 

Rank (prelim.) 7 4 1 3 14 13 10 12 5 9 8 2 11 15 6 

Rank (profic.) . 4 6 3 7 15 11 14 12 1 13 5 2 9 10 8 

Does this suggest that the preliminary test was a good predictor 

of the results in the proficiency test ? 

To calculate r it is convenient to rearrange one ranking so as to be in 
the natural order 1, . . . «. If we do so for the ranking in the preliminary 
score we have, for the ranking in the proficiency score — 

3 2 7 6 1 8 4 5 13 14 9 12 11 15 10 ... (a) 

The score obtained by considering the first member 3, in conjunction with 
the others is 12—2=10, for there must be 12 members greater than 3 
and 2 less than it. Similarly the score (apart from that involving the 3 
which has already been counted) involving the 2 is found to be 11. That 
involving the 7 is 4. The total score (the reader should check this result) 
is then 

10 -4- 1 1+4 -f 5 -f 104-5 7 -2 -3-1-4 -1 -fO-1 =57 
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Thus, since the maximam possible score is 105 we have 




indicating a moderate, but not a very high, correlation between the 
rankings in the two tests. ; 

When one ranking is in the natural order a slightly simpler miethod of 
calculating t may be used. In the ranking («) we count theWmbo^ 
of members greater than 3 lying to the right of 3 (giving 12), t^n the 
nu m ber greater than 2 lying to the right of 2 (again 12} and so on! If R 
is the total score so obtained \ 


_ 2R _ 

’‘-i«(i.-l) 


a relation which the reader can easily prove for himself. 


( 11 . 12 ) 


11.20 It is useful to remember that for large n the following relation 
usually holds approximately except for values of p or r near to unity — 


3t 

“2 


(11.13) 


For instance, in the data of Example 11.2 we found p^0’4S and rs0*33. 

1L21 It is rather more troublesome to calculate r than to calcuhfte p, 
but r has advantages for more advanced work. 

(a) Where sampling effects are in question the significance of r may 
be tested by known methods but little is known about p except in one 
i^tecial case (cf. 19.31-19.34). 

(&) T may be extended to partial rank correlations. 

(c) If an extra member is added to the ranking (as, for instance, if one 
has been accidentally omitted ot further information arrives late) it is 
easier to recalculate r than p. In fact, in making a new determinatiim of 
p, it may be necessary to re-rank mainy of the members and hence to 
recakulate the values of i ; whereas for r we need only consider the 
additional scores attaching to the new member added. 

Tied ranks 

11.22 In some classes of ranking work, as far instance in arranging 
rtudei^ in rndw of merit, it is impossible to dtetingukh between a number 
of ai^acent individuals. In such a case it is custmnaiy to average the 
mdci and to asrign the same tank to eadi even though it may be firaetiimtaL 
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For example, in a ranking of 10, we may be able to assign one individual 
to the rank 1, but be unable to decide which of the next two members 
shall be second and which third. They are therefore " tied ” and each is 
given the rank |(2+3)==2J. The next member is then ranked 4, and so 
on. If we had to tie the next three members we should allot to each 
the rank J(4+5+6)=5. The general procedure will now be clear. 

11.23 When ranks are tied we have a choice in the calculation of p and r. 
Let us in the first place determine the effect on the sum of squares of the 
ranks of tying t individuals occupying the ranks k-\-2, . . . ft+l. 
The sum of squares of untied ranks 

(ft+l)*+(A+2)*+ . . . (A+/)*=<A*+A/(<+l)+J<(f+l)(2<+l) 

The sum of squares of the tied ranks is — 

The difference is then — 

Consequently, if we tie t ranks the sum of squares is lowered by 
The mean value of the ranks is the same, i(n+l) and hence the variance 
of the tied ranking is lowered by f). Moreover, the effect of 

tying different sets is evidently additive, so that if we have a ranking 
with ties of t^, tg, . . . ti and 


the variance of the ranking is — 

is(x«)=i(n*-l)-lrx (11.14) 


Similarly it will be found that 


where Tyhths quantity corresponding to Tx for the second ranking. 

Hence, if we continue to regard p as the product-moment corrdation 
of the rankings we have — 


p a(«»_n)~2rx>*U(»*-»)-2ry)"* 


(11.16) 


as compared with the simpte formula (11.9) to uddch it tednoes if 

T^MT’yMiO. 
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11^ The reader will sometimes find other formulae in use. For instance, 
(11.9) is sometimes used as it stands for tied ranks. This is certainly 
wrong. An alternative is to convert S(*>') for ties as in (11.15) but not 
to correct the variances, which leads to the formula 


6{S(i») +rA-+r Y} 
«•— n 


,(11.17) 


to which (11.16) reduces if we put Tx=TY=i) in the denorninatcir only. 

11.25 From some points of view (11.17) may be justifiable. Suppose 
we have two judges who rank a number of candidates identically, though 
there are ties present. In such a case (11.16) is the form to use, for ^e are 
measuring the agreement between them and the correlation should be 
unity. Both judges may be wrong, but that is not the point. We are 
measuring their agreement, not their accuracy. 

But if we have one observer ranking a number of objects which really 
have an objective order (11.17) may be preferable. The observer may tie 
certain ranks because of an inability to distinguish between the individuals 
concerned. In using (11.17) we take this into account in ascertaining the 
covariance of (11.15) ; but in deciding to make allowance in the variance 
we are refusing, so to speak, to give him credit for clustering his values 
because he ought not to do so, there being a really objective order. The 
effect of using (11.17) instead of (11.16), of course, is to give a lower value 
to p, which appears to conform to the common-sense requirements of the 
position wherein we are measuring the observer's ability to rank individuals 
in their real order. 

11.26 In the calculation of r we allot to any tied pair the score 0, this 
being the intermediate point between the scores of -1-1 or —1 which 
would result if one were greater than the other. The effect of this is 
to lower the maximum possible score for X by 




(11.18) 


the summation taking place over the ties as for Corresponding to 
(11.16) we shall then have 


■'{i«(n— ij— l)-t7y}» 


. (11.19) 


and corresponding to (11.17) 


^ in(n-l) 


. ( 11 . 20 ) 


In both these fonnulae the score S is, of course, affected by ties. 
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Example 1 1 .4. — ^Two foremen rank ten employees according to suital^ty 
for promotion as follows — 

Employee . .ABCDEFGHIJ 

Foreman 1 . . 3 4 6 6 6 8 9^ 

Foreman 2. .1 24446789 10 

In the first ranking there are three sets of ties and we have — 

^^=i'2{(2*-2)+(3»-3)+(2»-2)} 

=3 


Similarly 

ry=,y3>-3) 

= 2 

The differences d are 

i-t, -1, 0, 2. 0. -1. 0, i -i 

and hence 
Hence from (11.16) 

_lJ^-3-2-7 

^"V(159xl61) 

=-0-956 

The scores S contributing to t, taking the first employee A with the 
others, then B with C . . . J and so on, will be found to be 

8+8+5+5+3+3+3+2+0=37 


We also have 

Ux==l{2+3.2+2} 

=5 

C7y=3 

Hence, from (11.19) 

37 

--S/ (40x42) 

=0-903 

Either coefficient indicates a high degree of agreement betwemi the judges. 



268 


THEOKY or STATISTICS 


Relatioosbip between rank correlation and product-moment correlation 

11-2 7 The rank correlation coefficients as we have introduced them are 
merely measures like the coefficients of association, contingency and 
product-moment correlation, of the correspondence between two quantities. 
Like those coefficients, they are affected by sampling fluctuations. 

They are, however, more easily calculated than most coefficients, and for 
this reason some writers have advocated their use as a substitute for the 
product-moment coefficient between the actual measurements, ^d for 
estimating the product-moment coefficient from a normal populaticm. We 
proceed to examine this practice briefly. ^ 

\ 

Grade correlation \ 

11.28 We referred at the end of Chapter 6 to such quantities as quaitiles, 
deciles and percentiles, which are values of the variate dividing the total 
frequency into certain specified proportions. For instance, the seventh 
decile is the variate value such that seven-tenths of the distribution lie 
below it, i.e. exhibit values of the variate less than the decile. 

Generally, we may regard the grade of am individual ais the proportion 
of individuals which lie below him (cf. 6.31). If the population is con- 
tinuous, the ramge of grades will also be continuous. 


11.29 To each individual in a bivariate population there will be attached 
two grade numbers, one for each variate, amd if the population is correlated 
the grades will also be correlated. In fact, it has been shown that if the 
population is normal, />*, the grade correlation, and r, the ordinary correla- 
tion (both calculated by the product-moment method), are related by the 
equation 



11.30 Ranks and grades are connected by a simple relation. In fact, 
if an individual is of rank k, there are i — 1 individuals below him (assuming 
that the ranking proceeds from the lowest variate value). If we admit, 
conventionally, that one-half of the individual is to be regarded as lying 
to the left of the line of division which he makes, and one-half to the 
right, his grade, g*, is given by 


^*=(*-l)+i=ft-i . . . (11.22) 

It follows that the correlation between ranks is the same as the correla- 
tion between grades. But in a population which is finite and discontinuous 
(and ranking is in practice applied to comparatively small populations of 
twenty or thirty individuals) it does not JoUow that 


( 11 ^ 23 ) 
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Equation (11.21) was obtained by considering grades in a continuous 
population, and equation (1 1.23) is at best an approximation, depending on 
assumptions which are often of doubtful legitimacy. This is a fact udiich 
has not always been appreciated. We may, perhaps, clarify the point by 
considering the data of Example 11.2. 

Example 11.5. — In Example 11.2 we found — 


p=+0-45 

If we apply -(11. 23) we find — 

f=2 sinl3*5“ 

= +0-47 

Let us consider what this means. 

The value r purports to be a correlation coefficient such as would have 
been obtained by the product-moment method if the two variates had been 
measurable in the ordinary way. Let us, for the sake of argument, agree 
that mathematical and musical abilities are capable of measurement. 

Now there are only ten members in this population, and it cannot be 
regarded with any degree of accuracy as a continuous normal population. 
The use of (1 1 .23) in finding the correlation in the population of ten is there- 
fore of doubtful validity, to say the least. 

But it is possible to look at this from rather a different point of view, 
and to regard the ten students as a sample from a practically infinite 
population which is continuous and normal. The value r is then taken to 
be an estimate of the correlation coefficient in this population. 

The legitimacy of this procedure will depend on the extent to which the 
grade correlation in the sample can be taken to represent the grade correla- 
tion in the population. It will, we think, be sufficiently evident from the 
smaUness of the sample that the two are likely to diverge considerably 
owing to sampling fluctuations. 

Furthermore, in the comparatively small samples to which (11.23) is 
applied — the labour of calculating the rank correlation coefficient for large 
samples is very tedious — ^it is difficult to obtain any satisfactory evidence 
from the data .themselves that the population can properly be regarded as 
normal ; and even if the distribution of each of the variates, taken singly, 
can be rendered normal by some appropriate transformation of the variate 
which squeezes or stretches the scale of measurement, it does not 
necessarily follow that the correlation distributimi can in this way be 
rendered normal. 

As a matter of interest we may record that, onresponding to (il.lQ 
for p we have also the relation 


fassia- 


nr 


2 


. ^ 1 . 24 ) 
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The use of this equation is, of course, subject to the same objections as 
lie against (11.23). 

Use of (11.23) and (11.24) should therefore be made with the utmost 
reserve. It would probably be better to avoid them altogether and rely 
on the rank correlation coefficient. 

11.31 The relationship between the product-moment coefficient and 
the rank correlation coefficients might profitably be subjected toj further 
investigation, particularly for small numbers of individuds. As we have 
just seen, with the present state of our knowledge, the use of tqe rank 
coefficient is not to be recommended as a brief method of estimating the 
product-moment coefficient. It is, however, of service as a quick rt|iethod 
of gauging relations between variates which are not normally distributed 
and in any case it is useful where the variates can be ranked but not 
measured for either practical or theoretical reasons.* 

Tetrachoric r 

11.32 To complete our account of methods which have been devised 
as alternatives to the use of the product-moment correlation coefficient in 
cases where, for some reason, that coefficient cannot be computed, we may 
refer to a process specially adapted to the 2 y 2 contingency table. 

Consider such a table in the schematic form — 



A 

JHot-A 

Total 

B . 

a 

h 

a-f 6 

Not-B 


d 

c+d 

Total 

a^j-c 

b-^d 

N 


Let us assume that our attributes A and B are, in theory, based on 
measurable quantities ; and let us suppose further that the population 
would be normally distributed with respect to those quantities as variates. 
Then we may regard the above table as the result obtained by dividing a 
bivariate normal population into four sections, a division of the X-variate 
at some point, say A, and a division of the Y-variate at some point A. If 
we picture the population as a solid figure, as in fig. 9.1, page 208, the 
frequencies a, b, c and d will be the volumes into which the population is 
divided by planes perpendicular to the X and Y axes through the points 
X=A and Y=A, respectively. 

The problem then arises, given a, b, c and d, what are the values of 
k and A (in terms of the standard deviations of X and Y), and what is 
the value of r ? 


• For some turther developments ol thjs subject see Kendall's Ifank Correlatwn MethwHi. 
Second edition, 1955. 
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11.33 A discussion of this problem, which involves some difficult mathe- 
matics, is outside the scope of this book. The student may be referred 
to Kendall’s Advanced Statistics, vol. 1, for an account of the method and 
to Tables for Statisticians and Biometricians, Parts I and II, for tables 
which are almost indispensable in working out r for any given case. 

A value of r obtained in this way is said to be tetrachoric. 

The coefficient has often been used to obtain a value of the correlation 
(so-called) for a contingency table, using some reduction to the four-fold 
form by amalgamating adjacent arrays, or possibly making more than one 
such reduction and averaging the results. As such tables are very often 
far from normal, it is always desirable to test the normality by using more 
than one reduction. In any case the reader should be informed precisely 
as to the reduction used. 

The product-moment correlation coefficient for a 2x2 table 

11.34 The correlation coefficient is in general only calculated for a table 
with a considerable number of rows and columns, such as those given in 
Chapter 9. In some cases, however, a theoretical value is obtainable 
for the coefficient, which holds good even for the limiting case when 
there are only two values possible for each variable (e.g. 0 and 1) and 
consequently two rows and two columns (cf. Exercises 11.5 and 11,6). 
It is therefore of some interest to obtain an expression for the coefficient 
in this case in terms of the class-frequencies. 

Using the notation of Chapters 1-3 the table may be written in the 
form — 


Values of 
second 
variable 

Values of first 
variable 

ATi X\ 

Total 


(AB) 

(uB) 


X\ 

(^/l) 

(‘/f) 

ifi) 

Total 

M) 

(«) 

N 


Taking the centre of the table as arbitrary origin and the class-interval, 
as usual, as the unit, the co-ordinates of the mean are — 

The standard deviations a,, a, are given by 

Oj^=0-25~p={A)ia) IN* 
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Finally. 

Writing 
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{AB)-{A)(B)IN^d 


(as in Chapter 2) and replacing ^ by their values, this reduces to 


Whence 


NS 

^-V(A){cc)iBm ' 


We may also put this in the form 




(11.26) 


where x* is the square contingency as defined in 3.8. 

This value of r can be used as a coefficient of association, but, unlike 
the association coefficient of Chapter 2, which is unity if either {AB)=:(A) 
or {AB)={B), r only becomes unity if {AB)=s{A)={B). This is the 
only case in which both frequencies (aB) and {Afi) can vanish so that 
{AB) and (a/?) correspond to the frequencies of two points, Xj. Y,, X, Y, 
on a line. Obviously this alone renders the numerical values of the two 
coefficients quite incomparable with each other. But further, while the 
association coefficient is the same for all tables derived from one another 
by multipl 3 dng rows or columns by arbitrary coefficients, the correl|i,tion 
coefficient (11.25) is greatest when (A)={a) and (B)=(yff), i.e. whdi the 
table is symmetrical, and its value is lowered when the symmetrical 
table is rendered as)uiunetrical by increasing or reducing the number of 
A ’s or B's. For moderate degrees of association, the association coefficient 
gives much the larger values. The two coefficients possess, in fact, 
essentially different properties, and are different measures of association 
in the same sense that the geometric and arithmetic means are different 
forms of average, or the semi-interquartile range and the standard devia- 
tion different measures of dispersion. 

11.35 The student should realise that the product-sum correlation 
and the tetrachoric corrdation are also two entirely different measures 
with quite different properties. The one is in no sense an approximation 
to the oth^, and the two may often differ largely. 

Intradats cocrdatim 

11.36 We have previously considered correlations between two definite 
defined variates, such as age and yield of milk in cows, or stature <rf 
&tber and stature <4 son ; but there occurs, mainly in bi<d<^;tcal studies* 
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a rather different kind of correlation which we will now proceed to discuss. 

Suppose we are examining the relationship between the heights of 
brothers, and consider a pair of brothers. Our two variates will be (1) 
the height of the first brother, and (2) the height of the second brother. 
The question is, which are we to regard as the first brother and which as 
the second? It is not difficult to lay down roles which would enable us 
to make a distinction — ^for instance, we might take the elder brother 
first, or the taller brother first. But if we did this and drew up a correla- 
tion table for all such pairs, we should not be answering the question 
as to the relation between brothers in general, for we should only get a 
correlation between the height of taller brothers and that of shorter 
brothers, or the height of elder brothers and the height of younger brothers. 

11.37 The relationship of brotherhood is in fact symmetrical ; if vi is 
the brother of B, then B is the brother of A. When we are considering 
only the relationship in height implied by relationship of blood, there is 
no relevant character to enable us to single out one brother as the first. 

We accordingly treat the problem by taking each pair of brothers in 
two ways : (1) with the height of ^4 as the first variate and that of B as 
the second, and (2) with the height of B as the first variate and that of 
A as the second. Similarly, if there are k brothers in the family, we enter 
in the correlation table the results of takmg pairs in all possible ways, 
which number 1). For example, if we have a family containing 
three brothers with heights S ft. 9 in., 5 ft. 10 in. and 5 ft. 11 in., they 
may be regarded as giving six pairs of variate values — 

5 ft. 9 in. with 5 ft. 10 in. 5 ft. 10 in. with 5 ft. 9 in. 

5 ft. 9 in. with 5 ft. 11 in. 5 ft. 1 1 in. with 5 ft. 9 m. 

5 ft. 10 m. with 5 ft. 1 1 in. 5 ft. 1 1 in. with 5 ft. 10 in. 

11.38 Generally, if we have n families, each with k members, there wiU 
be nk{k—l) pairs, and hence the same number of entries m the table. 

Such a table is called an intraclass correlation taUe, and the corr elatio n 
between the two variates is called intraclass corrdaHon. 

Tables in which all the families have the same number are of particular 
importance, and will consider them first. It is, however, permisstble 
to apply the term intraclass correlation to the symmetrical tatde derived 
from families which have different numbers of members. This case we 
shall consider in 11.42. 

11.38 Hie intraclass correlation table has certain peculiarifies, and k 
not of such a general type as the ordinary table which we have considered 
hitherto (and which, for the purposes of distinction, is sometimes called 
an interdass table). 

Let the variate values in the first family be 

j* *u *it •••*%» 
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those in the second family being 


*tl . *2» 

and so on, those in the nth family being 

*«»i 

Consider the mean of the ^-variate. j 

In the table the value will be associated as an X-variate with each 
of the (A— 1) values * 1 , . . , x^t. Hence it appears (A— 1) times. Si^arly, 
every other value appears (A— 1) times. Hence the sum of the ina^nal 
row, corresponding to the X- variate, is (A— l)S(x), the summation ex- 
tending over all values. But there are nA(A— 1) members in the teble. 
Hence, 

-isw . . . (11.27) 

Similarly, 

. ( 11 . 28 ) 

i.e. the means of the variates are the same. This must evidently be the 
case, for the table is S3mimetrical. 

For the variance of X we have — 

and since each x— X occurs (A— 1) times, 

<yx*=^S(x-X)* .... (11.29) 

the summation, as before, extending over all the values of x. 

Similarly, 

We therefore write 

^ SSB^ir ^SSSmiy ^ 
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11.40 For the correlation coefficient r we have 

where the summation S' extends over all the possible pairs. 

We can put this formula into a much simpler form. 

Consider the terms in (11.30) for which the first term is (*u— ^!). They 
will be the (*— 1) terms of the following series — 

(*11 ■^(*1* 

— (*11 — •^) {(*12+*13"1' • • • +*u) (^ — 0-^} 

Now write 

•^1— ^(*ll'l"*12+ • • • "f^lfc) • • (11.31) 

i.e. Xj is the mean of the members of the first family. Then our expression 
becomes 

(x,,-X){kX,~x,,-(k-i)X} 

=(x,,-X){k(X,~X)+X-x,,} 

=k(X,-X)(x,,-X)-(x,,-X)» 

The sum S' of (11.30) will contain nk such terms. 

Hence, 

«A(fc-l)a*r=/feS(.^i-X)(x„-.J)-S(xii-X)* (11.32) 

the summation extending over all the nk members. 

Now, 

ki:(X,-X)(x,,-X) 

=sum of ft terms like ixi(Xi—X)(Xi—X) 
=^A*Z'(Xi-X)* 

S* extending over the n families ; and 

I^(Xii—X)^—nAa* 

Hence, from (11.32), 

nk(k-l)(T*r=kT(Xf-X)*-a*Hk 

Now -S*(Xi— is the variance of the means of families about the 

H 

mean of the whole. Calling this on,*. we have 
n/t(k—l)a*r=sk*n(Tm*—(r*nk 

{14-f(*— l)}o*=te»* . (11.33) 
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This result gives us the intraclass correlation in terms of the variance of 
the distribution (according to either variate) and the variance of the 
means of families. 

Example 11,6. — In five families of 3 the heights of brothers are : 5' 9', 
5' 10", 5' 11" ; 5' 10", 5' 11", 6' 0" ; 5' 11", 6' 0", 6' 1" ; 6' 0", 6' 1", 6' 2" ; 
6' 1", 6' 2". 6' 3". Find the intraclass coeflScient of correlation. 

Here the mean of the whole =6'. 

ff*=^{9+4+l + 4 + 1 + 1 + 1 + 1 + 4 + 1 +4+9} 
a,*=?{4+l+0+l+4)=2 


Hence, from (11.33), 

{l+2f}|=3x2 

l+2r=2*25 


f=+0'625 

11.41 We may notice two rather unusual results which follow from 
equation (11.33). 

In the first place, since On? is not negative, 

l+r(^~l)>0 

and hence, 


Thus, whereas the interclass correlation coefficient can vary from —1 to 
+1, the intraclass coefficient cannot be less than For example, in 


families of threes the intraclass coefficient cannot be less than — 
Secondly, let us consider the correlation within a single family, i,c. when 
if«l. 


In this case, a»*==0, and hence 


1 

*-l 


For kss2, 3, 4, . . . this gives the successive values of r«s — i, 

— J, * . . It is dear that the first value is correct, for the two vdltm % 
md Xf determine only two points {x^x^ and {x^^, and the slope of the line 
joining them is negative. 

The student should notice that a corresponding pegative association 
win arise between the first and second m^nbers of the pair if all 
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pairs are chosen from a population in which the variates can assume only 
two values, say 0 and 1, or in which only ^I’s and not-^'s are distinguished. 
We use this result later in 17.36. 

11.42. Reverting now tp the more general case, suppose we have w 
families whose members number k^, k^, . . . k». 

The »th family contributes 1) pairs to the intradass table, and 
hence the total number of pairs is l^{ki{ki—l)}=N, say,'the summation 
extending over the n families. 

Let the variate values be 


^12 • 


*11 • 


*nl 



As in 11.41, we see that in the intraclass table each member of the first 
family appears (^i— 1) times, each of the second (kt—l) times, and so on. 
Hence, 

X=?=^{(fe-l)S'(x.?)} . (11.34) 

the summation 2' being carried over all members of the tth family and S 
over all families. 

Similarly, 

(11.35) 

and 

the summation extending over all possible pairs, 
and this, as in 11.40, reduces to 

Nah’=^Z{ki*(Xi-X)'}-'SL'(xi)-X)» (11.36) 

These formulae are considerably more complex than those of 11.40, 
but reduce to those forms if ^ is constant for all families. 


SUMMARY 

1. In cases where the data are incomplete, or in order to avoid lengthy 
calculation, it is possible to use various methods of approximating to the 
product-moment coefficient of corrdation, provided that the regresaon ia 
approximatdy linear. 

2. Cases in which the r^;r^on is non-Unear can sometimes he reduced 
to the linear case by a suitable transformation of tite variates. 



278 


THEORY OF STATISTICS 


3. The correlation ratio of -X" on Y is given by 


»• =1 — — 
a* 



where is the variance of X, is the weighted average of the variances 
of arrays and ai, the variance of the means of X-arrays, weighted 
according to the number of individuals in the arrays. \ 

4. 7 ^— r® cannot be negative, and if it is zero the regression of if on Y 

is linear. \ 

5. Speannan’s rank correlation coefficient is given by \ 


where x and y are the deviations of the ranks X and Y from the mean 
n+1. 

6. If d»={Xk-Yt) 


e'L(d*] 
«® 


7 . The rank correlation coefficient r is given by 


T 


in(M-l) 


21 ? 


-1 


where 5 is the sum of scores obtained by allocating +1 if pairs of ranks 
are in the same order in the two rankings and —1 in the contrary case ; 
and R is the sum of scores for positive scores only. 

8. The coefficient of intraclass correlation is given by 


where a is the standard deviation of X and Y, and Om is the standard 
deviation of the means of families, there being n families each of k 
members. 


EXERCISES 

1 1.1 Find to 3 places of decimals the correlation ratio of X on Y and of 
y on X for the distribution of cows of Table 9.4, page 204 (r*:* +0*219). 
Hence, show that 

f*=»0*011 

7 ^— f*=.0*023 
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11.2 Find the correlation ratios of the distribution of marriages of Table 
9.2. 

11.3 In a test of ability to distinguish shades of colour, 15 discs of 
various shades, whose true orders are 1, 2, . . . 15, are arranged by a subject 
in the order 7, 4, 2, 3, 1, 10, 6, 8, 9, 5, 11, 15, 14, 12, 13. Find the rank 
correlation coefficients p and r between the real and the observed ranks. 

11.4 Ten competitors in a beauty contest are ranked by three judges 
in the orders 

1, 6, 5, 10, 3, 2, 4, 9, 7, 8 
3, 5, 8, 4, 7, 10, 2, 1, 6, 9 
6, 4, 9. 8, 1, 2, 3, 10, 5, 7 

Use rank correlation coefficients to discuss which pair of judges has the 
nearest approach to common tastes in beauty. 

1 1 .5 (Cf. Pearson, " On a Generalised Theory of Alternative Inheritance,” 
Phil, Trans., A, 1904, 203, 53.) If we consider the correlation between 
number of recessive couplets in parent and in offspring, in a Mendelian 
population breeding at random (such as would ultimately result from an 
initial cross between a pure dominant and a pure recessive), the correlation 
is found to be 1 /3 for a total number of couplets n. If m= 1, the only 
possible numbers of recessive couplets are 0 and 1, and the correlation 
table between parent and offspring reduces to the form 


Offspring 

Parent 

0 1 

Total 

0 

5 

1 


1 

1 

1 


Total 

6 

2 

□□ 


Verify the correlation, and work out the association coefficient Q. 

11.6 (Cf. the above, and also Snow, Proc. Roy. Soc., B, 1910, 83, 42.) 
For a similar population the correlation between brothers, assuming a 
practically infinite size of family, is 5/12. The table is 


Second 

First brother 


brother 

0 

1 

Total 

0 


inill 

Bl 


m 

9 

IB 

Total 

48 

16 

firn 


Verify the condation, and work out the association coefficient Q. 
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11.7 Establish equation (11.26). 

tfX 

11.8 Show by drawing a graph that the values of x and 2sin-^ are 

o 

never very different for the range — and that the greatest difference 
is about 0*018 (Cf. equation (11.23)). 

11.9 Referring to the notation of 11.34, show that we have the following 
exfoessions for the regressions in a fourfold table 

_ Nd _(AB) {A^ 

\ {Bm ~{B) (P) 

o, m {AB) (aB) \ 

(A)(ar(A)' (a) .\ 

Verify on the tables of Exercises 11.5 and 11.6. 

11.10 In four pea-pods, each containing eight peas, the weights of the 
peas are, in hundredths of a gramme : 43, 46, 48, 42, 50, 45, 45 and 49 ; 
33, 34, 37, 39, 32. 35. 37 and 41 ; 56. 52, 50, 51, 54, 52, 49 and 52 ; 36, 
37, 38, 40, 40, 41, 44 and 44. Find the coefficient of intraclass correlation. 

11.11 (Data from O.H. Latter, Biometrika, 1905, 4, 363.) 

The following table shows the length of cuckoos’ eggs fostered by 
various birds — 


Foster parent 

40 

Length of egg (units ^ millimetre) 

41 42 43 44 45 46 47 48 49 

50 

Totals 

Robin 

1 

1 

8 

3 

9 

13 

20 

6 

11 

2 

2 

76 

Wren 

7 

5 

14 

8 

9 

6 

3 

2 

— 

•— 

— 

S4 

Hedge-sparrow 

— 

— 

2 

5 

14 

13 

13 

3 

5 

— 

3 

58 

Totals 

8 

6 

24 

16 

32 

32 

36 

11 

16 

2 

5 

188 


Find the coefficient of intraclass correlation, and state how many entries 
there would be in the intraclass correlation table. 

11.12 If t consecutive ranks are replaced by a single tie, show that, for 
both p and r, the resulting coefficients are the means of the f I coefficients 
obtained by permuting the t original ranks in all possible ways. Show 
that this remains true if there are several sets of tied ranks in either 
ranking. 
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corrdation 

12.1 In Chapters 9 to 11 we developed the theory of the correlation 
between a single pair of variables. But in the case of statistics of 
attributes we found it necessary to proceed from the theory of simple 
association for a single pair of attributes to the theory of a^odation for 
several attributes, in order to be able to deal with the complex causation 
characteristic of statistics ; and similarly the student will find it impossible 
to advance very far in the discussion of many problems in correlation 
without some knowledge of the theory of multiple correlation, or correlation 
between several variables. 

For example, in considering the relationship between the number of 
children per family, level of income and age at marriage, it might be 
found that the number of children was negatively correlated with income 
and also with age at marriage ; and the question might arise how far 
the first correlation was affected by the fact that people with higher 
incomes tend to marry later. The question could not at the present 
stage be answered by working out the correlation coefficient between the 
last pair of variables, for we have as yet no guide as to how far a correlation 
between the variables 1 and 2 can be accounted for by correlations between 
1 and 3 and 2 and 3. 

Again, a marked positive correlation might be observed between, say, 
the bulk of a crop and the rainfall during a certain period, and practically 
no correlation between the crop and the accumulated temperature during 
the same period ; and the question might arise whether the last result 
might not be due merely to a negative correlation between rain and 
accumulated temperature, the crop being favourably affected by an 
increase of accumulated temperature if other thirds were equal, but failing 
as a rule to obtain this benefit owing to the concomitant deficiency of rain. 
In the problem of inheritance in a population, the corresponding problem 
is of great importance, as already indicated in Chapter 2. It is essential 
for the discussion of possible hypotheses to know whether an observed 
corrdation between, say, grandson and grandparent can or mnnot bjs 
accounted for solely by observed correlations between grandson and 
parent, parent and grmidparent. 

Partial ngrcishms and conrdation codEdmti 

1Z2 Problems of this type, in whidi it is necessary to ocmaider ainud* 

ail 
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taneously the relations between at least three variables, and possibly 
more, may be treated by a simple and natural extension of the method 
used in the case of two variables. The latter case was discussed by form- 
ing linear equations between the two variables, assigning such values 
to the constants as to make the sum of the squares of the errors of estimate 
as low as possible : the more complicated case may be discussed by 
forming linear equations between any one of the n variables jinvolved, 
taking each in turn, and the n— 1 others, again assigning such lvalues to 
the constants as to make the sum of the squares of the errors oflestimate 
a minimum. If the variables are X^, , . . X^, the equanon will 

be of the form ' 

If in such a generalised regression equation we find a sensible positive 
value for any one coefficient such as we know that there must be a 
positive correlation between X^ and X^ that cannot be accounted for by 
mere correlations of and with Ag, X4 or for the effects of 
changes in these variables are allowed for in the remaining terms on the 
right. The magnitude of gives, in fact, the mean change in Xj 
associated with a unit change in when all the remaining variables are 
kept constant. 

The correlation between X^ and Xg indicated by 63 may be termed 
a partial correlation, as corresponding with the partial association of 
Chapter 2, and it is required to deduce from the values of the coefficients 
6, which may be termed partial regressions, partial coefficients of correlation 
giving the correlation between Xj and Xg or other pair of variables when 
the remaining variables X3 . . . X„ are kept constant, or when changes 
in these variables are corrected or allowed for, so far as this may be done 
with a linear equation. For examples of such generalised regression 
equations the student may turn to the illustrations worked out later 
in this chapter. 

12.3 With this explanatory introduction, we may now proceed to the 
algebraic theory of such generalised regression equations and of multiple 
correlation in general. It will first, however, be as well to revert briefly 
to the case of two variables. In Chapter 9, to obtain the greatest possible 
simplicity of treatment, the value of the coefficient r^p was deduced 
on the special assumption that the means of all arrays were strictly 
collinear, and the meaning of the coefficient in the more general case was 
subsequently investigated. Such a process is not conveniently applicable 
when a number of variables are to be taken into account, and the problem 
has to be faced directly : i.e. required, to determine the coefficietUs and 
constant term, if any, in a regression equation, so as to make the sum of 
the squares of the errors of estimate a minimum. 

124 To solve this problem we proceed as in 
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Let us measure the variates . . . X„ from their respective means, 
denoting the quantities so obtained by . x^. 

Then the regression equation of, say, % on , x„ may be written 
in the form 

• • • +6n*n 

We have to find Oy, 6,, . . . such that 
is a minimum, the summation taking place over all sets of values of 

Xy . . . x„. 

Now, 

Ey^l:{ay*)+'Z(Xy-b2X2- . . . 

the product term 

^{(ll(Xy--b^2~ • • • 

vanishing, since Xy, etc. are measured from the mean. 

Hence we have, for the minimum value of Ey, 

ay=0 

Now, if by is chosen so that Ey is a minimum, the value of Ey, when 
(6, +5) is substituted for by, is increased no matter how small S may be ; 
i.e. 

S{*i byXy , . . b„x„)^ 

Expanding the left-hand side, and neglecting 5*, which can be made as 
small as we please compared with 3, 

^(Xy—byXy— . . . ^ B * b )* {■^*('''1 • • • 

^2(Xj byXy . . . 

or 

S{Xy(Xy-byXy- . . . 

Now this is to be true for all small values of 3. positive or negative, 
if ^{xy(xi—byXy— . . . —&«*,)} wcrc not zero, this would be impossible, 
for if it were positive, say, we could take 3 positive and the inequality 
would not be satisfied. 

Hence, 

S {*«(*! • • • -W>=0 

Similarly, considering by instead of by, we have 
£{*8(*i-Vs- • • • -&•*•)} =0 

tmd so on, there being (n — 1} equations. These are sufficient to determine 
the (n— 1) quantities 6, . . . 6^ and hence our problem is solved. 
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Notation 

12.5 At this point we introduce a flexible notation which will enable 
us to consider any regression equation. 

We write — 

. . . n^ 2 +^J 3.*4 . . . l^^•+ • • • ( 12 . 1 ) 

The quantities h are partial regression coefficients. The first subscript 
attached to the b is the subscript of the letter on the left (the dependent 
variable). The second subscript is that of the x to which it is at^ched. 
These are called primary subscripts. \ 

After the primary subscripts, and separated from them by a^oint^ 
are placed the subscripts of the remaining variables on the right. These 
are called secondary subscripts. 

Equation (12.1) is the regression equation of Xy Similarly, in accord- 
ance with the rules we have just laid down, we have — 

and so on. 

It should be noted that the order in which the secondary subscripts are 
written is immaterial ; but this is not true of the primary subscripts ; e.g. 
Vi.. , * and Jjj , ^ ^ denote quite distinct coefficients, being the 

dependent variable in the first case and x* in the second. 

A coefl&cient with p secondary subscripts may be termed a regression 
of the ph order. The regressions ij*, b^y etc., obtained by con- 

sidering two variables alone, may be regarded as of order zero, and may 
be termed total, as distinct from partial, regressions. 

12.6 If the regressions Jj, ^ 24 . . . »» be assigned the 

" best " values, as determined by the method of least squares, the difference 
between the actual value of x^ and the value assigned by the right-hand 
side of the regression equation ( 12 . 1 ), that is, the error of estimate, will be 
denoted by Xj ,3 ; i.e. as a definition we have — 

^1.13. . ^11.34 , . 14 ... ••• 1)^11 (12.2) 

where Xy x ^, ... are assigned any one set of observed values; Such an 
error (or residual, as it is sometimes called), denoted by a symbol with P 
secondary suflixes, will be termed a deviation of the j^th order. ^ 

Finally, we will define a generalised standard deviation a, ts • by 
the equation 

(12.3) 

H being, as usual, the number of observations. A standard deviation 
teioted by a $ym 1 x)l with p secondary suffixes will be termed a standard 
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deviation of the ^th order, the standard deviations Oj, a,, etc., being 
regarded as of order zero, the standard deviations o. „ cr, ,, etc., of the first 
order, and so on. ' ^ 

12.7 In the case of two variables, the correlation coefficient fj, may 
be r^^ded as defined by the equation 




We .^hall generalise this equation in the form 

^11.84 . . . «=(^li.84 . n^U.S4 . . .«)* • (12.4) 

This is at present a pure definition of a new s 5 ntnbol, and it remains to be 
shown that . . . n really be regarded as, and possesses all the pro- 
perties of, a correction coefficient ; the name may, however, be applied 
to it, pending the proof. A correlation coefficient with p secondary 
subscripts will be termed a correlation of order p. Evidently, in the 
case of a correlation coefficient, the order in which both primary and 
secondary subscripts is written is indifferent, for the right-hand side of 
equation (12.4) is unaltered by writing 2 for 1 and 1 for 2. The corrda- 
tions fji, f^i, etc., may be regarded as of order zero, and spoken of as totof, 
as distinct from partial, correlations. 

Hie normal equations 

12.8 All the quantities we have just defined are expressible in terms 
of the total and partial regression coefficients, and particular importance 
therefore attaches to the equations which give those coefficients. The 
equations of 12.4 may be written 

S(*.*i.... »)=0 (12.5) 

etc., there being (n— 1) equations for each r^ession equation. 

These equations are called the normal equations. 

12.9 If the student will follow the process by which (12.5) was obtained, 

he will see that when the condition is expressed that ^ ^ „ shaU 
possess the " least-square " value, x, enters into the product-sum with 
*i.ts . . . • : the same condition is expressed for enters 

into the product-sum, and so on. Taking eadi r^;ressi<»i in turn, in far^ 
every x the suffix of which is included in the secondary suffixes of « 

enters into the product-sum. The normal equations of the form (12.5) are 
therefore equivalent to the theorem — 

The product-sum of any deviation of order zero with any deoiaHonofhi^htr 
wrier is zero, provided the adscript of the former ocew tfmoeg the seootmavy 
SM&aeriyfs of the latter. 
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12.10 But it follows from this that 

S{^1.34 . . . n^2.84 . . . 34 ^ ^ fi(^2**^88.4 . . . • •-^2».84 




. . . «^ 2 ) 


. . «^2.34 . . 

n) . . . n) 


. i»^8.84 . . . 

(n-i))=2I(^j 34 ^X^) 


Similarly, 
Similarly again, 


and so on. Therefore, quite generally, 

^(^ 1.34 . . n^2.84 . , . n) ~^{-^l.S4 • • • {n-l)^2.84 . . . «) | 

(^1^2. 34 . . . n) 

84 . . n^2 84 . . . (n-l)) f 


(n - l)^n) } 


( 12 . 6 ) 


— L(x'j 34 «T 2 ) 

Comparing all the equal product-sums that may be obtained in this way, 
we see that the product-sum of any two deviations in which all the secondary 
subscripts of the first -occur among the secondary subscripts of the second is 
unaltered by omitting any or all of the secondary subscripts of the first, and, 
conversely, the product-sum of any deviation of order p with a deviation of 
order p+q, the p subscripts being the same in each case, is unaltered by adding 
to the secondary subscripts oj the former any or all of the q additional sub- 
scripts of the latter. 

It follows therefore from (12.5) that any product-sum is zero if all the 
subscripts of the one deviation occur among the secondary subscripts of the 
other. As the simplest case, we may note that is uncorrelated with 
and X 2 uncorrelated with x^ 2 - 

The theorems of this and of the preceding paragraph are of fundamental 
importance, and should be carefully remembered. 


12.11 We can now show that the quantities r defined by (12,4) arc 
really coefficients of correlation. In fact we have, from the results of 
12.9 and 12.10, 


That is, 


0=S(^2.84, . 

n^l.234 

.«) 


^^{^2.84 . 

{Xi 642.34 . 

terms in to 

~^(^1^8.34 

«)'~"^12.34 . 


"^^(^1.84 . . 

. n^2.84 


nS(**,.,4 . . . J 


^12.84 . . . n = 


j.^(^‘l.84 n^8.81 

^(^8.84 . . . n) 


(12.7) 


But this is the value that would have been obtained by taking a regression 
equation of the form 


H.U . , . fi®®^18.84 , , . tt^S.81 


H 
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and determining 34 . . . sby the method of least squares, i.e. ,4 , 

is the regression of *1.34 _ „ on ** 34 . . . It follows at once from 

(12.4) that ru 34 ...» is the correlation between 34 and *3.34 
and from (12.7) that we may write 

. . . n=>'lJ.S4 . . . n, • (12.8) 

•'*.84 . . . n 

an equation identical with the familiar relation with the 

secondary suffixes 34 ... « added throughout. 

To illustrate the meaning of the equation by the simplest case, if we had 
three variables only, x^, Xj and x^, the value of *18.8 or rij 3 could be 
determined (1) by finding the correlations rjj and r23 and the corresponding 
regressions Sj, and ftj, ; (2) working out the residuals x■^ —613X3 and Xj— 
633X3 for all associated deviations ; (3) working out the correlation 

between the residuals associated with the same values of Xg. The method 
would not, however, be a practical one, as the arithmetic would be extremely 
lengthy, much more lengthy than the method given below for expressing 
a correlation of order p in terms of correlations of order p~\. 

Eqnression of standard deviation in terms of standard deviations and 
coefficients bf lower mrders 

12.12 Any standard deviation of order p may be expressed in terms of a 
standard deviation of order p— 1 and a correlation of order p— 1. For, 

^(*1.88 . . . «)*— ^(*1.88 . . . (n-l)*l.88 . . . «) 

=2(X,.„ . . . («-i))(*i-6i«.a8 . . . (»-i)*n -terms in Xj to x^.J 

=S(x* 88 _ U-i)) 6i„ 33 _ (n-i)S('*l.t8 . . . («-l)*«.8S . . . (»— l)) 

or, dividing through by the number of observations — 

*^1.88 ' »— ®1.8S . . . ^ln.88 . . . (»-l)®nl.8S . . . (n-l)) 

—^1.88 . . . (»— l)(t ^li».88 . . . • (12.9) 

This is again the relation of the familiar form 

w*«=»x{l ->'?») 

with the secondary suffices 23 . . . in— 1) added throughout. It is clear 
from (12.9) that 3, . . . (i_„), like any correlation of order zero, cannot be 
numerically greater than unity. It also follows at once that if we have 
been estimating Xi from X3, x,, . . . x„_i, x, will not increase the accuracy 
of estimate unless f , 3, 3, . . . („ _i) (not r^^) differ from zero. This condition 
is somewhat interesting, as it leads to rather unexpected results. For 
example, if ri3='i-0*8, >'ij=a4-0*4, r33=!-f-0*5, It will not be posable to 
estimate Xj with any greater accuracy from x* and x, than from x, alone, 
for the value of fig.g is zero (see below, 12.15). 
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12.13 It should be noted that, in equation (12.9), any other subscript 
can be eliminated in the same way as subscript n from the suffix of 
^i.s8 . . . n» so that a standard deviation of order p can be expressed in p 
ways in terms of standard deviations of the next lower order. This is useful 
as affording an independent check on arithmetic. Further, ^ ^ 
can be expressed in the same way in terms of Oi t, , ^ andj so on, so 

that we must have 

• • • (1 -'l, . . U-l))\ (12.10) 

This is an extremely convenient expression for arithmetical uife ; the 
arithmetic can again be subjected to an absolute check by eliminating the 
subscripts in a different, say the inverse, order. Apart from the algebraic 
proof, it is obvious that the values must be identical ; for if we are 
estimating one variable from n others, it is clearly indifferent in what 
order the latter are taken into account. 

-' 1.18 ^ csin also be expressed in terms of and the total correlation 

coefficients. We have 

2(^1. 18 . . . *3 ...«)} ==^ 0 ^ 1 .f 8 . . . n 

Hence, expanding 33 . . . n. 

^*■“"^ 11.8 . . . n^ll^l^l"^^18.8 . . . fi^l 8 ^ 1 ®' 8 “~ • • • =^1.18 • • • n 

The (n— 1 ) normal equations involving % 33 ^ are 

^{^8^1.18 . . . 

i.e. expanding, 

^ 11 ^ 1^1 “~^ 11.8 . . . fi^l’~^18.1 . . . n^l 8 ^ 1^8 • • • =0 

^81^1^8““ ^11. 8 . . . ••^88^8®'l““^18.1 . . . n^i • • • = 0 » 

Regarding the n equations so obtained as equations in the quantities b, 
we have, on elimination, the determinaht 

<^*-<^*.18 .... »’u<^lCf 8 . .. rin<Ti<r„ 

a} r„<j,<r, . . . r*,a,CT, 

• • •• •• 
f’MffnO'l 

Dividing the sth row by a, and the fth colomn by a„ this gives — 
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Write w for the determinant 

1 »'ii • • • ♦'in 

fji 1 ... fj, 

• • • • 

^nl ^ni • * • 1 

and let be the minor of the term in the first row and column. Then 


Similarly, 

and so on. 

These results exhibit 


<0- 


h u . . . n 

, ^ 11 — y 




6) 


18 . . 


Wll 

b) 


. ( 12 . 11 ) 


0) 


88 


n.aa , 


etc., in a symmetrical form. 


Expression of regression coefficients in terms of coefficients of lower orders 
12.14 Any regression of order p may be expressed in terms of regressions 
of order ^—1. For we have — 

^(^.84 , . n^8.84 . . ii)==^(^1.84 . . <ii-l)^8.84 . . n) 

== 2 :(Xi.a 4 . . . (n-D^n-terms in Xj to J 

~^{^1.84 . . ( 11 - 1 )^ 2.84 . . (il-l)) “~^an.84 . . (il-l)^(^l.34 . . («-l)^il.84 . , (k-I)) 

Replacing Jm.si , . (n-l) by . . (n-l)0^a.84 . . in-l)l^n,U . . («-l) 
we have — 

. . Ii®‘a.84 . . n=^ia.84 . . («-l)^a.84 . . (ii-l)“~^lii.84 . . (»-l)^*i8.84 . . (l•-l)<^t.84 . (n-l) 


or, from (12.9), 


‘'18.84 


, ^18.84 . . (ft-l)"~^ni.84 . . (n-l)^w8.84 . . (n-l) 
^ “~^2n.84 . . (l•-l)^l•8.84 . . (n-l) 


( 12 . 12 ) 


The student should note that this is an expression of the form 


I. __®18“~^ln^n8 

with the subscripts 34 . . . (n— 1) added throughout. The coefficient 
^ 18 . 84 . . . « ^erefore be regarded as determined from a r^ession 
equation of the form 

^1.84 . . , (ii-l)®*^18.84 . . . fi^f.84 . , , (ii-l)+^lii.88 . . . (ii~l)^«.84 . . . (»-l) 

le. it is the partial regression of Xi ^ . . . on x^m . . . M* ^%.u ...Mi 
being given. As any other secondary suffix might have 
in lieu of », we might also regard it as the partial regression ci Xim , . . # 
^ ^ 8 . 48 . . . n given, and $0 on. 
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Expression of correlation coefficient in terms of coefficients of lower 
orders 

12.15 From equation (12.12) we may readily obtain a corresponding 
equation for correlations. For (12.12) may be written — 


^ 12.34 . 


.^18.84 . . 84 84 ^n-l) 34 . , (n~l) 

'^2».34 . (n~l) ®^2.84 . . . (n~l) 


Hence, writing down the corresponding expression for 
taking the square root — 

_^12.34 . . . in-l) ^ln34 (n~l)^2n 34 (n-l) 

...» (1 \ -vi„ 34 :: : u.,)* 

This is, similarly, the expression for three variables — 


and 


i 

t 


j^2.13) 


^12* "^ln^2w 

with the secondary subscripts added throughout, and ^12.34 . . . n can be 
assigned interpretations corresponding to those of ^12 34 ... n above. 
Evidently equation (12.13) permits of an absolute check on the arithmetic 
in the calculation of all partial coefficients of an order higher than the 
first, for any one of the secondary suffixes of ^12.34 . . . n can be eliminated 
so as to obtain another equation of the same form as (12.13), and the 
value obtained for ^12.34 ... n by inserting the values of the coefficients 
of lower order in the expression on the right must be the same in each case. 


Practical procedure 

12.16 The equations now obtained provide all that is necessary for 
the arithmetical solution of problems in multiple correlation. The best 
mode of procedure on the whole, having calculated all the correjlatfons 
and standard deviations of order zero, is (1) to calculate the correlations 
of higher order by successive applications of equation (12.13) ; (2) to 
calculate any required standard deviations by equation (12.10) ; (3) to 
calculate any required regressions by equation (12.8) ; the use of equation 
(12.12) for calculating the regressions of successive orders directly from 
one another is comparatively clumsy. We will give two illustrations, 
the first for three and the second for four variables. The introduction of 
more variables does not involve any difference in the form of the arithmetic, 
but rapidly increases the amount. 

Example 12.1, — In Exercise 9.2, page 234, we gave some data of (1) 
the average earnings of agricultural labourers, (2) the percentage of the 
population in receipt of poor law relief, (3) the ratios of the numbers in 
receipt of outdoor relief to those relieved in the workhouse, for 38 rural 
districts. Required to work out the partial correlations, regressions, etc., 
for these three variables* 
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Using as our notation Xi=average earnings, JSL,=percentage of 
population in receipt of relief, X8=out-relief ratio, the first constants 
determined are — 

15 *9 shillings ai=l ’71 shillings 0*66 

Mj= 3*67 per cent o-j=l -29 per cent ri 3 =— 0*13 

5-79 Os=3-09 r„=+0-60 

To obtain the partial correlations, equation (12.13) is used direct in 
its simplest form — 


The work is best done systematically and the results collected in 
tabular form, especially if logarithms are used, as many of the logarithms 
occur repeatedly. First, it will be noted that the logarithms of (1 — r*)* 
occur in all the denominators ; these had, accordingly, better be worked 
out at once and tabulated (col. 2 of the table below). In column 3 the 
product term of the numerator of each partial coefficient is entered, i.e. 


1 

2 

3 

1 4 

5 

6 

7 

8 

9 


log \/ 1^75 

Product 

Numera- 

log 

log 

denom. 

Correlation of 
first order 

logV 1—*'' 





log 

Value 

0* 13 
r««+0-60 

1 87580 
1-99629 
1-90309 

000 

11 + 

-0-5820 
+ 0-2660 ! 
+0-5142 

1-76492 
T- 42488 
I-711I3 

1-89938 
1-77889 
1-87209 ; 

ill 

fi,.,-0-78 

rt».t+0*44 

1-83216 

1-95267 

1-85946 


the product of the two other coefficients on the remaining lines in column 1 ; 
subtracting this from the coefficient on the same line in column 1, we have 
the numerator (col. 4) and can enter its logarithm. The logarithm of the 
denominator (col. 6) is obtained at once by adding the two l(^;aiithms of 
(1 — r*)* on the remaining lines of the table, and subtracting the logarithms 
of the denominators from those of the numerators, we have the logarithms 
of the correlations of the first order. It is also as well to calculate at 
once, for reference in the calculation of standard deviations of the second 
order, the values of log ■y/l — r* for the first-order coefficients (col. 9). 

Having obtained the correlations, we can now proceed to the regressions. 
If we wish to find all the regression equations, we shall have six r^;ressions 
to calculate from equations of the form 

^11.8— l ^ t .* 

These will involve all the six standard deviations of the first order 
^i.t> standard deviations of the first order are not 
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in themselves of much interest, but the standard deviations of the second 
order are important, as being the standard errors or root-mean-sqnare errors 
of estimate made in using the regression equations of the second order. 
We may save needless arithmetic, therefore, by replacing the standard 
deviations of the first ord^ by those of the second, omitting the former 
entirely, and transforming the above equation for 5m to the form 

This transformation is a useful one and should be noted by the student. 
The values of each a may be calculated twice independently by the fonnulte 
of the form 

Oj — r*j)t(l — f**.*)* 

so as to check the arithmetic ; the work is rapidly done if the values ^of 
log Vl— r* have been tabulated. The values found are — 

log CT, ,,=0"06146 ■'i.» 8 ==l ’ 15 

log o'8.ia=i^' 84584 o,.„=0*70 

log 0*34571 2*22 

From these and the logarithms of the r’s we have — 

log iu.B=0-081l6. 6i».s=-l-21 log 5j, , =1.361 74 

log 6 ji^ 3=1 *64993, — 0*4o log 5j3^j=1.33917 

log 5 m.»=T* 93024, 631 . 3 = 4-0*85 log 6 „.i =0.33891 

That is, the regression equations are — 

( 1 ) *i=—l*21*,-f 0*23*3 

( 2 ) * 2 =— 0*45*1 -f 0 * 22*3 

(3) *3=4-0*85*1 4-2* 18*3 

or, transferring the origins to zero — 

(1) Earnings A'i=4-19*0— 1 * 21 X 3 - 1 - 0 * 23 X 3 

(2) Pauperism X,=*f9*55— 0 * 45 Xi-H 0 * 22 X 3 

(3) Out~relief ratio X 3 =- 15 * 74 - 0 * 85 Xi 4 - 2 * 18 X 3 

The units are throughout one shilling for the earning Xi, 1 per cent for the 
pauperism X 3 and 1 for the out-relief ratio X 3 . 

Now let us examine the light thrown by these results on the relationship 
between the variables. 

The first and second regression equations are those of most practical 
importance. The argument was once advanced that the giving of out- 
relief tended to lower eamii^s, and the total coefficient 0*13) 

between earnings (Xj) and ORt-relief (X3), though very smaU, doiSs not 
seem inconsistent with such a hypothecs. The partial correlation 
codBdent (ri,. 3 = 44 >* 44 ) and the regression equation ( 1 ), however. 


6i,,,=4-0*23 

633 1 = -4-0*22 
633.1=! 4*2* 18 
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indicate that in unions with a ^ven percentage of the population in receipt 
of relief (A^,) the earnings were highest where the proportion of out-rdief 
was highest ; and this is, in so far, against the hypothesis of a tendency 
to lower wages. It remained possible, of course, that out-relief might 
adversely affect the possibility of earning, e.g. by limiting the employment 
of the old. 

As regards pauperism, the argument might be advanced that the 
observed correlation (fss=+0‘60) between pauperism and out-relief was 
in part due to the negative correlation (ris= —0* 13) between earnings and 
out-relief. Such a hypothesis would have little to support it in view of the 
smallness and doubtful significance of and is definitely contradicted 
by the positive partial correlation r„.i=4-0*69 and the second regression 
equation. The third regression equation shows that the proportion of 
out-relief was on the whole highest where earnings were highest and 
pauperism greatest. It should be noticed, however, that a negative ratio 
is clearly impossible, and consequently the relation cannot be strictly 
linear ; but the third equation gives possible (positive) average ratios for 
all the combinations of pauperism and earnings that actually occur. 

Example 12.2 {Four variables ). — ^As an illustration of the form of the 
work in the case of four variables, we will take a portion of the data from 
another investigation into the causation of pauperism. 

The variables are the ratios of the values in 1891 to the values in 1881 
(taken as 100) of — 

1. The percentage of the population in receipt of relief, 

2. The ratio of the numbers given outdoor relief to the numbers relieved 

in the workhouse, 

3. The percentage of the population over 65 years of age, 

4. The population itself, 

in the metropolitan group of 32 unions, and the fundamental constants 
(means, standard deviations and correlations) are as follows — 

TABLE 12.1 


1 

Means 

2 

Standard 

deviations 

3 

Correlation 

coefficient 

4 

logVi^ 

1 

104*7 

1 

29*2 

12 

H-052 

I-931S4 

2 

90*6 

2 

41*7 

13 

+0-41 

1-96003 

3 

107*7 

3 

5*5 

14 

-0*14 

I -99570 

4 

111*3 

4 

23*8 

23 

+0-49 

1-94038 




MM 

24 

-fO-23 

1-9^ 


MM 

MM 

MM 

$4 

. 

4-0-25 

T-98S9B 
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It is seen that the average changes are not great ; the percentages of the 
population in receipt of relief increased on an average by 4'7 per cent, 
the out-relief ratio dropped by 9-4 per cent and the percentage of the 
old increased by 7*7 per cent, while the population of the unions rose 
on the average by 11 *3 per cent. At the same time the standard devia- 
tions of the first, second and fourth variables are very large. As a matter 
of fact, while in one imion the pauperism decreased by nearly 50 per cent 
and in others by 20 per cent, in some there were increases of 60, 80 and 


TABLE 12.2 


1 

Correlation 
coefficient 
(zero order) 

2 

Product 
term of 
numerator 

3 

Numerator 

4 

Correlation 
coefficient 
(first order) 

1 

log V 1 

\ 

12 

4-0-52 

+0-2009 

+ 0-3191 

12-3 

4-0-4013 

T-96187 

13 

4-0-41 

4-0-2548 

+0- 1552 

13-2 

4-0-2084 

1-99035 

23 

4-0-49 

4-0-2132 

4-0-2768 

23-1 

4-0*3553 

1*97070 

12 

4-0-52 

-0-0322 

4-0-5522 

12-4 

4-0-5731 

1-91355 

14 

-0-14 

4-0-1196 

-0-2596 

14-2 

-0-3123 

1-97772 

24 

4-0-23 

-0 0728 

4-0-3028 

24*1 

4-0-3580 

1-97022 

13 

4-0-41 

-0-0350 

4-0-4450 

13-4 

4-0-4642 

T- 94731 

14 

-0-14 

4-0-1025 

-0-2425 

14-3 

-0-2746 

T- 98297 

34 

4-0-25 

-0 0574 

4-0-3074 

34-1 

4-0*3404 

j 1-97326 

23 

4-0-49 

4- 0-0575 

4-0-4325 

1 23-4 

4-0-4590 

1-94863 

24 

4-0-23 

4-0-1225 

4-0-1075 

1 24-3 

4-0-1274 

T -99645 

34 

4-0-25 

4-0*1127 

4-0-1373 

34-2 

4-0-1618 

1-99424 


90 per cent ; similarly, in the case of the out-relief, in several unions the 
ratio was decreased by 40 to 60 per cent, a consistent anti-out-refcfef 
policy having been enforced ; in others the ratio was doubled, and more 
than doubled. As regards population, the more central districts showed 
decreases ranging up to 20 and 25 per cent, the circumferential districts 
increases of 45 to 80 per cent. The correlations of order zero are not 
large, the changes in the rate of pauperism exhibiting the highest correlation 
with changes in the out-relief ratio, slightly less with changes in the 
proportion of old and very little with changes in population. 

The correlations of the second order are obtained in two steps. In the 
first place, the six coefficients of order zero are grouped in four sets of three, 
corresponding to the four sets of three variables formed by omitting each 
one of the four variables in turn (Table 12.2, col. 1). Each of these sets 
of three coefficients is then treated in the same manner as in the last 
example, and so the correlations of the first order (Table 12.2, col. 4) are 
obtaiiMd. The first-order coefficients are then regrouped in sets ofithree, 
with the same secondary suffix (Table 12.3, col. 1), and these are treated 
fHedsely in the same way as the coefficients of order zero. In this way. it 
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will be seen, the value of each coefficient of the second order is arrived at in 
two ways independently, and so the arithmetic is checked : ri occurs in 
the first and fourth lines, for instance, r,, ,4 in the second and seventh, and 
so on. Of course slight ffifiercnces may occur in the last digit if a sufficient 
number of digits is not retained, and for this reason the intermediate work 
should be carried to a greater degree of accuracy than is necessary in the 
final result ; thus four places of decimals were retained throughout in the 
intermediate work of this example, and three in the final result. If he 
carries out an independent calculation, the student may differ slightly 
from the logarithms given in this and the following work, if more or fewer 
figures are retained. 

TABLE 12.3 



1 

2 

3 


4 

5 

Correlation 

Product 


Correlation 


coefficient 
(first order) 

term of 
numerator 

Numerator 

coefficient 
(second order) 

log V 1 — r* 

12- 4 

13- 4 
23-4 

+ 0*5731 
+ 0*4642 
+0*4590 

+ 0*2131 
+0*2631 
+0*2660 

+ 0*3600 
+0*2011 
+0*1930 

12*34 

13*24 

23*14 

+0*457 

+0*276 

+0*266 

1*94901 

1*98277 

1*98408 

12-3 

14*3 

24*3 

+ 0*4013 
-0*2746 
+0*1274 

-0*0350 

0*0511 

-0*1102 

+0*4363 

-0*3257 

+0*2376 

12*34 

14*23 

24*13 

+0*457 

-0*359 

+0*270 

T»7013 

I-983S9 

13-2 

14*2 

34*2 

+0*2084 

-0*3123 

+0*1618 

-0*0505 

+0*0337 

-0*0651 

+ 0*2589 
-0,3460 
+0*2269 

13*24 

14*23 

34*12 

+0*276 

-0*359 

+0*244 

T'98664 

23*1 

24*1 

34*1 

+0*3553 

+0*3580 

+0-3404 

+0*1219 
+ 0*1209 
+ 0*1272 

+0*2334 

+0*2371 

+0*2132 

23*14 

24*13 

34*12 

+0*266 

+0*270 

+0*244 



Having obtained the correlations, the regressions can be calculated from 
the third-order standard deviations by equations of the form (as in the k^t 
example), 

’i*.84— 


so the standard deviations of lower orders need not be evaluated, 
equations of the form 


we find : 


®'l.884— ^l(^ ~^18)*(^ “''it.as)* 

=ai(l — r*4)t(l — >'*8.4)*(1 — »'J8,84)* 


log Uj 884=1*35740 
log 08184=1*50597 

log o,.88*=0-65773 
log 04,tn«sl '32914 


O1.8M— 22*8 

08.18# =32*1 
08'184= 4*55 
04.118=21*3 


Uang 
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All the twelve regressions of the second order can be readily calculated, 
given these standard deviations and the correlations, but we may confine 
ourselves to the equation giving the changes in pauperism (X^) in terms of 
other variables as the most important. It will be found to be 

*i=s0*325*,+l • 383*,— 0- 383*4 

or, transferring the origins and expressing the equation in terms of per- 
centage ratios, 

Xi=-31 • 1 -t-0-325X,-hl •383X,-0'383X4 

or, again, in terms of percentage changes (ratio — 100) — 

Percentage change in pauperism 

= -}-l*4 per cent 

-|-0‘325 times the change in out-relief ratio 
-1-1 *383 „ „ „ proportion of old 

—0*383 „ „ „ population 

These results render the interpretation of the total coefficients, which 
might be equally consistent with several hypotheses, more clear and definite. 
The questions would arise, for instance, whether the correlation of changes 
in pauperism with changes in out-relief might not be due to corrdiation of 
the latter with the other factors introduced, and whether the negative 
correlation with changes in population might not be due solely to the 
correlation of the latter with changes in the proportion of old. As a matter 
of fact, the partial correlations of changes in pauperism with changes in 
out-relief and in proportion of old are slightly less than the total correla- 
tions, but the partial correlation with changes in population is numerically 
greater, the figures being — 

fjj=-f-0*52 

4-0*41 >'i8.m=+0*28 

ri4=— 0*14 ^i4.i»=~0*36 

So far, then, as we have taken the factors of the case into account, there 
appears to have been a true correlation between changes in pauperism and 
changes in out-relief, proportion of old and population — the latter serving, 
of course, as some index to changes in genei^ prosperity. The relative 
influences of the three factors are indicated by the regression equation 
above. 

In this and the previous example we have had to consider only three 
or four independent variables. For five or more the number of partial 
corrdations and regressions increases rapidly (see Exodse 12.6) and it 
becomes impracticable to compute them all without great labour.* In such 
circumstances, where we are primarily interested in the r^ession of one 
variate on the others it may well bie easier to solve direct the normal 
equations given at the end of 12.4, either by pro g re s sive elinunation oi 
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variables in the usual manner for simultaneous linear equations or by 
evaluating determinants svstematically. See the comments on this point 
in 13.27-13.29. 


Aids to calculation 

12.17 To facilitate the computation of {-artial correlation and regression 
coefficients, various tables of such quantities as 


1 

have been piepared. 


Vl-r*. 


1 


See, for instance, T. L. Kelley's Statistical Tables, 


The generalised scatter diagram 

12,18 The scatter diagram in two dimensions may be generalised to 
three dimensions, and may also be used as a mental construct for higher 
dimensions, though no actual model can of course be made. 

Consider the case of three variates. The values of Xg and X 3 
associated with any given individual may be regarded as determining a 
point in space whose co-ordinates are X^^, X 2 and X 3 . The totality of 
individuals will therefore give us a swarm of points in three-dimensional 
space, which will lie distributed in certain ways about planes of regression. 
The closeness with which the points lie to the regression planes is a 
measure of the adequacy of the representation by regression equations. 
In figure 12.1 we give a diagrammatic representation of the data of 
Example 12.1 with the regression plane of Xj on the other two variables. 



Fig. 12.1.— Ceneraiiied scatter diagram for Uirte varlaldet 
Data of Example 12.1. Xi«*average turnings, X^m percentage popo2ati<m in 

recent of relief, out-relief ratio. 




298 


THEORY OF STATISTICS 


CoeC^aent of nuiltiide o»relation 

12.19 Consider the regression equation for * 1 , 

. . . ««**+^X8.a . . . n**+ • • • 

Let us write the right-hand side of this equation as ei_„ . . . so that in 
virtue of ( 12 . 2 ), 

M . . . M . . . « • • • (12.14) 

Now consider the correlation between and ,3 . . . Wej have 
in virtue of the theorem of 12 . 10 — 1 

^(*1^1. M . . «)==^{*l(*l *1.M . . . n)} 

=iV(oJ-aJ.j 3 . . . «) 

A.lso 

^(^l.ts . «)*“^(*X *1.*S . . . •)* 

=iV(a*-a|. 33 ...J 
Hence, the correlation between and ei .33 . . 

OiVcrf o,** 3 ' 

We shall call this quantity i?i(j 3 ...«). We have immediately — 

‘^?M...«=‘^f(l-«?(. 3 ...«)) • • • (12.15) 

/?j( 3 . . . «} is called the multiple correlation coefficient between .€nd 
X 3 . . . x^. We have, similarly, multiple correlations between x^ and 
fewer variables. „) is called an (»— l)-fold multiple correlation 

coefficient. 12,(3 i=;) would be an («— 2 )-fold coefficient, and so on. 

12.20 The value of R may be calculated either directly from equation 
(12.15), or by substituting in that equation the value of 0^,33 , „ obtained 
in ( 12 . 10 ), which gives — 

^ . . . ii)“(l *'ii)(l ^*».i)(l . (1 — (12.16) 

Fmperties of the cocrdation co«tfident 

12,(33 . . . nh being the correlation between x, and ^,,33 . . . 
measures how closely *, can be represented by the regression equatkm. If 
12 «si, X, can be perfectly represented by such an equation, i.e. is a linear 
function of X 3 . . . x«. In this case o} 33 ,as 0 , i.e. all the residuals are 

aero. 
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It may. in fact, be shown that l?j(„ . . . ,) is greats: than the correlation 
between and any linear function of . . . x„ other than that expressed 
in the regression equation, he. «j Putting this another way, the 

regression coefficients in , may be determined by the condition 

that the correlation between Xi and Ci.,, . , . * is a maximum. 

R is necessarily positive or zero 

ft 

12.22 This is true, since the product term . . . n) is pontive, 

being equal to iV(oJ— aj ,, . . . and we see from (12.10) that 

®1.I8 . 

Further, from (12.16). 

i , . . n)^l — 

i.e. R is not numerically less than r j,. Similarly, it is not numerically less 
than any other total or partial correlation coefficient which can appear 
in (12.16). Hence, /?*(, is not numerically less than any possible 
constituent coefficient of correlation. 

It follows from this that if R^i^ «)=0, all the corrdation coeffidents 
involving Xj are zero, i.e. the variate is completely uncorrelated with the 
other variates. 

12.23 Further, even if all the variables X„ . . . X„ were strictly 
uncorrelated in the original population as a whole, we should expect fi,, 
fj, ,, etc. to exhibit values (whether positive or negative) difiering 
from zero in a limited sample. Hence, R will not tend, on an average 
of such samples, to be zero, but will fluctuate round some mean value. 
This mean v^ue will be the greater the smaller the number of observations 
in the sample, and also the greater the number of variables. When mily 
a small number of observations is available it is, accordingly, little use to 
deal with a large number of variables. As a limiting case, it is evident 
that if we deal with n variables and possess only n observations, all the 
partial correlations of the highest possible order wiU be unity. We shall 
deal with the question of the significance of an observed value of JR in 
Chapter 22. 

Example 12.3. — In Example 12.1 we found — 

f„=-0-66 

f„.,=+0.44 

Hence, from (12.16). 

1 ~(0*66)»} {1 -(0-44)*} 

<»0-455 

whmoe 

^?i(ts>*0-74 
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Similarly, it will be found that 
and 

^'3(1 2) ==0-70 

The student may verify by inspection that these values are greater than 
the corresponding constituent values. 

Expression of regressions and correlations in terms of coefficieiikts of 
higher orders I 

12.24 It is obvious that as equations (12.12) and (12.13) enable us to 
express regressions and correlations of higher orders in terms of tho^e of 
lower orders, we must similarly be able to express the coefficients of liwer 
in terms of those of higher orders. Such expressions are sometimes u^ful 
for theoretical work. Using the same method of expansion as in previous 
cases, we have — 


That is. 


0=S(-ri 23 . . . »»^'2.S4 . , . (n-l)) 

. . . (n-l)) ^12 34 . . . n^(^2^2.34 . . . (n-l)) 

'~’^ln.28 . . . (n-l)S('^rj^2.34 . . . (n-i)) 


^12.34 , , . U-l) — ^12 34 , . . n4"^ln.23 . . . (n~l)^n2 34 . . . (n-l) 

In this equation the coefficient on the left and the last on the right are of 
order 3, the other two of order n —2. We therefore wish to eliminate the 
last coefficient on the right. Interchanging the suffixes 1 for n and.^ ior 
I, we have — 


^«2 84 . . . (n-l) ^n2.l8 . . . (n-l)"f ^rH.2S . . . (n-l)^l2.84 . . . (n-l) 

Substituting this value for . . . (n-i) in the first equation, we have — 


^18.34 . . , (n-l) 


^12 34 . . . n~h^l n.a3 • • • (n~l)^n2.18 . . . (n-l) 


^ ~“^ln.28 . . . {«-l)^nl.23 . . . (n-l) 

This is the required equation for the regressions ; it is the equation 


(12,17) 


^12 = 


s »"^^in.f^n2.3 
^ ■^’^in.t^nl.a 


with secondary suffixes 34 ... (n-l) added throughout. The corre- 
sponding equation for the correlations is obtained at once by writing down 
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equation (12.17) for 34 and taking the square root of the 

product ; tliis gives — 


^12.31 . . . (n 1 ) — 


^12 34 . . . 23 . . (n-l)^2«.l3 

23 . . . in ^2n.l3 . 


^ . . (n-i) 


which is similarly the equation 


(12.18) 


^12 2^ 2n.t 

with the secondary suffixes 34 . . . (n -1) added throughout. 


Conditions of consistence among correlation coefficients 
12.25 Equations (12.13) and (12.18) imply that certain limiting inequali- 
ties must hold between the correlation coefficients in the expression on 
the right in each case in order that real values (values between ± 1) may 
be obtained for the correlation coefficient on the left. These inequaUties 
correspond precisely with those “ conditions of consistence *’ between 
class- frequencies with which we dealt in Chapter 1, but we propose to treat 
them only briefly here. Writing (12.13) in its simplest form for 3, we 
must have 

that is, 


rfj 4 /-fs 4 >”23 -“2>'iiri3ra8 < 1 


. (12.19) 


if the three r’s are consistent with one another. If we take *'13 as 
known, this gives as limits for r23. 


Similarly, writing (12.18) in its simplest form for in terms of ^13.3, 
r,3 2 and we must have — 

^12 3 13 2 ^23.1 2.3^13.2^23.1 • (12.20) 

and therefore, if 3 and , are given, i must lie between the Umits 

The following table gives the limits of the third coefficient, in a few 
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special cases, for the three coefficients of zero order and of the first order 
respectively — 


Value of 

Limits of 

or 

^18 or ^8.8 

^8 

^88,1 

0 

0 

±1 

±1 

±1 

±J 

+ 1 


±1 

T1 


+1 

iVo-s 

iVcTS 

0,+l 

0,-1 

±-v/0-5 

tVo-s 

0. -1 

0, -f 1 


The student should notice tliat the set of three coefficients of order\zero 
and value unity are only consistent if either one only, or all three,'\^are 
positive, i.e. +1, +1, +1, or —1, —1, -fl ; but not —1, —1, —1. On the 
other hand, the set of three coefficients of the first order and value unity 
are only consistent if one only, or all three, are negative ; the only con- 
sistent sets are +1, -fl. —1 and —1, --1, — 1. The values of the two 
given r’s need to be very high if even the sign of the third can be inferred ; 
if the two are equal, they must be at least equal to Vo-S or 0’707 . . . 
Finally, it may be noted that no two values for the known coefficients eyer 
permit an inference of the value zero for the third ; the fact that 1 and 2, 
1 and 3 are uncorrelated, pair and pair, permits no inference of any kind 
as to the correlation between 2 and 3, which may lie anywhere between 
-bl and —1. 


Fallacies in the interpretation of correlation coefficients 
12J16 We do not think it necessary to add to this chapter a detailed 
discussion of the nature of fallacies on which the theory of multiple correla- 
tion throws much light. The general nature of such fallacies is the same 
as for the case of attributes, and was discussed fully in Chapter 2. It 
suffices to point out the principal sources of fallacy which are suggested 
at once by the form of the partial correlation 






and from the form of the corresponding expression for in terms of the 
partial coefficients — 






From the form of the numerator of («) it is evident (1) that even if be 
■will not be zero unless either or r„, or both, are zero. If fn 
and r„ are of the same sign, the partial correlation will be negative ; il of 
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oppoate sign, positive. Thus the quantity of a crop might appear to be 
unaffected, say, by the amount of rainfall during some period preceding 
harvest : this might be due merely to a correlation between rain and 
low temperature, the partial correlation between crop and rainfall being 
positive and important. We may thus easily misinterpret a coefficient of 
correlation which is zero. (2) fi,,, may be, indeed often is, of opposite 
sign to rj*. and this may lead to still more serious errors of interpretation. 

From the form of the numerator of {b), on the other hand, we see that, 
conversely, will not be zero even though * is zero, unless either 

, or fjj i is zero. This corresponds to the theorem of 2.26, and indicates 
a source of fallacies similar to those there discussed. 

12.27 We have seen that fi,. j is the correlation between and and 
that we might determine the value of this partial correlation by drawing 
up the actual correlation table for the two residuals in question. Suppose, 
however, that instead of drawing up a single table we drew up a series of 
tables for values of z, 3 and Zj 3 associated with values of z, lying within 
successive class-intervals of its range. In general, the value of rjj 3 would 
not be the same (or approximately the samel for all such tables, but would 
exhibit some systematic change as the value of Zg increased. Hence ^ 
should be regarded, in general, as of the nature of an average correlation : 
the cases in which it measures the correlation between Zj,, and z*., for 
every value of z, (cf. below 12 . 31 ) are probably exceptional. The process 
for determining partial associations (cf. Chapter 2) is, it will be remembered, 
thorough and complete, as we always obtain the actual- tables exhibiting 
the association between, say, A and B in the population of C’s and the 
population of y's: that two such associations may differ materially is 
illustrated by Example 2.9, page 34. It might sometimes .serve as a useful 
check on partial correlation work to reclassify the observations by the 
fundamental methods of Chapter 2. 

Multivariate normal correlation 

12.28 The theorems and results of Chapter 10 in regard to normal 
correlation can be extended to the case of n variates, which we have studied 
in this chapter. 

In fact, suppose we have « variates z^, z, Zj, . . . z*, measured from 
their respective means, with standard deviations Oj, a*, a„ . . . o*. Let 
us first consider the simple case in which they are normally distributed 
and each is completely independent of the others. 

Then, if . . . „ denote the frequency of the combination of deviations 
Zj, Zj, . . . z„, we have— 


where 


yit... n^y’it . 




^{* 1 * *•»••• 




Xm* Xm* 
Oj* Oi* 



{ 12 . 21 ) 
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Now consider the variates x^, Whether 

Xt, . . . x„ are correlated or not, these variates are uncorrelated, in 
virtue of 12.10. Let us further suppose they are independent and normally 
distributed. Then their distribution is given by 


where 

^(*1. *2,1. 

and 


,^12 • • • n~y'l2 • • • ^••1 ■ • ■ *",ll •••(«■ l)) 


\ "1 I 

Vl2 . . . (n-l>J==;i2iZ2 


2.1 


'2.1 


^i.lS . . . (n-i) 


y 12 • • ^ n-— a ' - - - 

(277)^102.1 . . . CT„ 12 . _ („_i) 

The expression (12.23) may be put in a more convenient form, 
be shown, but we omit the proof, that 


( 12 . 22 ) 

p2.23) 

2.24) 

i 

ItVnay 


^1.28 ... n ^2 13 ... n 


4 - . 1 !! 

^«.12 . . . (n-l) 


-2r 


12 8 ... n_ 


'1.23 . 




. n^2 18 ... n 


-2r(„.iW2,.. (n-2)^ , 

^n-3.1 . . . (n-2)n^»i.l . . . (n~l) 


(12.25) 


which exhibits the form as symmetrical in 
Now we show^ed in 12.13 that 


' 1.23 


^ 2 
a? 


(I) 


'll 


etc. 


In precisely the same way it may be shown that 




/ 

^l.ta . . . nO'l.lJ* . . . n 1^12.3 . . . 

Wj2 

Wj, being the minor in at of the term in the first row and the second 
column. 

If we substitute these and analogous values in (12.22), we get — 


where 


Vii . . . fi — 


{2ny 


N 





^ U V 

(O ( 


. • . +2Wij; 


XtX 




a,a 




. +2w».„~i: 


(12.26) 


This is a form which is very frequently quoted. 
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12.29 From these formulae several important results follow immediately. 

In the first place, for any fixed values A, . . . of x, . . . the 
exponent (12.25) becomes — 


of 


~2r. 


1.28 . . I 


12.84 . 






_ _ — + constant terms 

^1.88 . . n^H.l . . («-l) 


- constant t^nns. 

1^1. 28.. n ^8.13.. n ^«.l . . (n-:) j 

Hence is distributed normally about the mean, fn^, given by 


■'1.28 


, ^ 12.8 . 

'^i.is 


=A,+ . 


■'n.l 


U-l) 


(12.27) 


Hence every array of every order is normally distributed. 

It follows in a similar way that any linear function of the x*s is dis- 
tributed normally. 

In particular, all deviations of any order and with any number of 
suffixes are normally distributed. 


12.30 Secondly, as wiU be seen from (12.27), the regression of Xi on 

the other variables is linear. It follows that the regression of any variate 
on any or all of the others is linear. In (12.27), for instance, the ex- 
pressions * . «^i .28 ri^ ^ partial regressions _ ,i, etc. 

^ 2.18 

12.31 If, in equation (12.23) any fixed values be assigned to x, and 
all the following deviations, the correlation between x^ and x^, on ex- 
panding Xj.i* is, as we have seen, normal correlation. Similarly, if any 
fixed values be assigned to to Xg jjj, and all the following deviations, on 
reducing x, to the second order we shall find that the correlation between 

i and x^ i is normal correlation, the correlation coefficient being and 
so on. That is to say, using k to denote any group of secondary suffixes, (1) 
the correlation between any two deviations x^ j^ and is normal correlation ; 
(2) the correlation between the said deviation is r^n^jt vvhaiever the particular 
fixed values assigned to the remaining deviations. The latter conclusion, it 
will be seen, renders the meaning of partial correlation coefficients much 
more definite in the case of normal correlation than in the general case. In 
the general case represents merely the average correlation, so to speak, 
between x^j^ and x^ j ^ : in the normal case > is constant for all the sub- 
groups corresponding to particular assigned values of the other variables. 
Thus in the case of three variables which are normally correlated, if we 
assign any given value to Xg, the correlation between the associated values 
of and Xg is : in the general case ru g, if actually worked out for the 
various sub-groups corresponding, say, to increasing values of Xg, would 
probably exhibit some continuous change, increasing or decreasing as the 
case might be. 
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12.32 It will be noticed that all the preceding work in this chapter 
assumes the correlations to have been determined by the product-sum 
formula. The method has also been applied to correlations obtained in 
other ways, e.g. from four-fold or contingency tables. In spite of the 
favourable results of an experimental test (Newbold, Biometrika, 1925, 17, 
251) this procedure remains of doubtful value. 

12.33 It has been shown, however, that for the rank correlation coefficient 
7 a meaning can be assigned to partial coefficients calculated by a 
analogous to (12.13) for three variables, e.g., for three rankings 
we have — 

- ^11 *^1 8 'as 

expressing the relationship between rankings 1 and 2 if the influence of 
ranking 3 is eliminated. No similar results are known for Spearman’s p. 



.. is written — 

+^ln.tS . . . 


SUMMARY 

1. The regression equation of on Xj, x^ 

^i=^ia.84 . . . «^a+^i8.a4 . . , • • 

The deviation ,3 , . . « is defined as 

^l~‘^ia.84 , . . n^a"~^18.a4 . . . ir^s — • • 

and ai 33 , . . « is the standard deviation of ,3 . . . n* 

2. The equations ^ving the regression coefficients are — 


^in.as 





.n)=0 


.»)=o 


..«)=o 

^2,18 , . ,n» 

etc. 


3. The product-sum of any two deviations is unaltered by omitting any or 
all of the secondary subscripts of the first, if, and only if, all the secondary 
subscripts of the first occur among the secondary subscripts of the second ; 
conversely, the product-sum of any deviation of order p with a deviation 
of order p+q, the p subscripts being the same in each case, is unaltered by 
addmg to the secondary subscripts of the former any or all of the q 
additional subscripts of the latter. 




'1 t4 




« n 


4. 
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5. Any standard deviation of order p can be expressed in terms of a 
standard deviation of order — 1 and a correlation of order ^ — 1. In fact, 


^1.28 . . . n= =^1.28 . . . (n-l){l ““^in.28 . . , (n-l)) 


6 . 


^p.tZ . . . n'- 


__CiKTp 




where to is the determinant 

1 


'12 ^^18 

i ^28 


'in 

^2n 


'nl 


^«2 ^nS 


and is the minor of the element in the pth row and the ^th column. 

7. Any regression of order p may be expressed in terms of regressions 
of order 1. In fact, 


^12.84 




84 


(n~l>^n2.84 . . . (n-l) 


'***^ * ‘ ^ ^ ““^2n.84 . . . (n~l)^tt2.84.. . . (n-l) 

8. Similarly, for correlations — 

^12.84 . 


_^12 .84 . , . (n -l)’^ ^i n.84 . . . n-l)^2n,84 . , . (n- l) 

(1 — 34 (n-l))Hl ~^2n.84 . . . (n-l))* 

9. The coefficient of multiple correlation . . . «) is given by 

^1.28 . . . n=^l(l ~"^l(28 . . . n)) 


or 




<t) 


11 


= l—/?J(g3 . , . n) 


Also, 

1 — 1?J(28 . . . n)=(l ’“^12)(i ~“^18.l)(i "~^14.88) . . . (i “~^in.88 . . . (n--l)) 

10. R is necessarily not less than zero. If it is zero, the variate to 
which it refers is completely uncorrelated with the other variates* If 
R^l , there is a linear relation between the variates. 

11. The multivariate normal surface may be written — 

N 


where 


ytt ... ^ 

(2ir)*aia, . . . a,Vw 




+<»tt^«+ • • • +2w,,^ 




+2».., 


*^t\ 


* • . 
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EXERCISES 

12.1 (Hooker,/. /?. Stat, Soc. 1907, 65, 1). The following means, standard 
deviations and correlations are found for 

Seed-hay crops in cwts. per acre, 

X 2= Spring rainfall in inches, 

3= Accumulated temperature above 42® F. in spring, 

in a certain district of England during twenty years. 

28*02 4.42 

4*91 02== MO ri8=--0-40 

Af8=594 08=85 r23= -0*56 

Find the partial correlations and the regression equation for hay-croji on 
spring rainfall and accumulated temperature. 

12.2 In Exercise 12.1, find the multiple correlation coefficient of each 
variate on the other two. 

12.3 (The following figures must be taken as an illustration only : the 
data on which they were based do not refer to uniform times or areas.) 

A’i=Deaths of infants under 1 year per 1,000 births in same year (in- 
fantile mortality). 

X2=Number per thousand of married women occupied for gain. 
A'3=Death-rate of persons over 5 years of age per 10,000. 
X3=Number per thousand of population living two or more to a room 
(overcrowding). 

Taking the figures below for thirty urban areas in England and Wales, 
find the partial correlations and the regression equation for infaiitile 
mortality on the other factors. 

J|fi=164 ai= 20*0 

M2=158 02= 74*9 

*J1/3=143 03= 22*4 

J»f4=205 a4=1300 

12.4 In Exercise 12.3, find the multiple correlation coefficient of on 
X2 and X3 ; and of Xi on the other three variates. 

12.5 (Data from W. F. Ogbum, ** Factors in the Variation of Crime 
among Cities” Jour. Amer. Stat. Assoc,, 1935, 30, 12). 

For certain large cities in the U.S.A. — 

X2=Crime rate, being the number of known offences per thousand of 
population. 

Xi^Percentage of male inhabitants. 

X3=Percentage of total inhabitants who are foreign-born males. 


^i*=^+0*49 fj2=: 4-0*15 

rj3 = 4 0*78 r24=-~0*37 

ri4=+0*20 r34=4-0*23 
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Jf4=Number of children under 5 years of age per thousand married 
women between 15 and 44 years of age. 

X5=Church membership, being number of church members 13 years 
of age and over per 100 of total population 13 years of age 
and over. 

Mi= 19-9 a,= 7-9 ri,=+0-44 r„=-0-19 

Mj— 49-2 02= 1-3 rj3=— 0'34 r25=— 0-35 

Af 3 = 10-2 03 = 4-6 ri«=-0-31 r,4=+0-44 

A/4 =481-4 04 =74-4 r^5=:_0-14 f,3=+0-33 

M^— 41-6 05=10-8 r33=H-0-25 f46==+0'85 

Find the regression equation of on the other four variables. Find also 

Find, further, rjj 3, ^ and /'i5,34. Discuss the influence of church 

membership on crime for these data. 

12.6 Show that foi n variates there are total correlation coefficients. 
(« — 2)"C2 correlation coefficients of order 1 , "-®C3"C2 correlation coefficients 
of order 2, and "-^€,"€2 of order s. Hence show that there are n(n— 1)2"^ 
correlation coefficients and «(«— 1)2*-* regression coefficients. 

12.7 Find the number of multiple correlation coefficients of order s and 
the total number of such coefficients for n variables. 

12.8 If all the coirelations of order zero are equal, say=r, what are the 
values of the partial correlations of successive orders ? 

Under the same conditions, what is the limiting value of r if all the equal 
correlations are negative and n variables have been observed ? 

12.9 Write down from inspection the values of the partial correlations for 
the three variables 

X^t X and X^^^ciX'^-\~hX^ 

12.10 If the relation 

aA;i-i-6xj+c3:3=0 

holds for all sets of values of x^, 3^2 and *3, what must the partial correlations 
be i 
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CORRELATION AND REGRESSION 

SOME PRACTICAL PROBLEMS 


13.1 The student should be careful to note that the coefficient of co^ela- 
tion, like an average or a measure of dispersion, only exhibits in a sumthary 
form one aspect of the facts on which it is based. Some very real difficulties 
arise both in the selection of variables for which the coefficient is to' be 
computed and in the interpretation of the results when obtained. In 
the present chapter we shall consider some of these practical problems 
and indicate how they mould from the outset the scope and nature of 
an inquiry based on correlations and regressions. 

Hie modifiable unit 

13.2 Table 13.1 shows, for each of the 48 agricultural counties of 
England in 1936, the yields per acre of wheat and potatoes. The order 
of arrangement is the one given in the official Agricultural Statistics. 

It is a natural and meaningful question to ask whether there is any 
correlation between these yields, so that, for example, we may know 
whether an area of high wheat- 5 deld is also one of high potato- 5 deld. 

Taking the values of Table 13.1 as they stand we find a correlation of 
4-0'2189, a value which the student can verify for himself as an exen^. 
But we observe that these yields per acre are given for 48 geogtapmcal 
areas the boundaries of which are quite arbitrary so far as crop 3 delds 
are concerned, ^\llat would happen if we took other geographical areas ? 
Should we get the same correlation or not ? 

We can explore this question to some extent by combining the areas 
as given. Suppose we group the counties in pairs and determine for each 
of the 24 resulting pairs the simple arithmetic mean yields as exemplified 
in the figures following Table 13.1 on the next page. 

Since most of the areas are contiguous this is the kind of result 
we m4;ht get if larger areas 'than counties were recorded. The yields 
per acre so calculated are not necessarily those of the grouped pairs 
because the total yields may be greater in one member of the pair than in 
the other ; but the process will serve for the purposes of illustration. 

There are now 24 members and the correlation between the yields will 
be found to be +0*2963 against +0*2189 for the original 48. If we 
rq>eat the process and group our 24 pairs {in order as they stand) we find 
for tbaxes^ting 12 members a correlation of +0*5757, In practioe we 
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should not compute a correlation for a smaller number of values but if 
we pursue the condensing process to the bitter end and group our 12 
values into 6, we find a correlation of +0*7649 ; and finally, by grouping 
the six into three, we have a correlation of +0*9902. 

TABLE 13.1. — ^Yields of wheat and potatoes in .43 counties In England In 1936 


County 

Wheat 
(cwts. 
per acre) 

Potatoes 
(tons 
per acre) 

County 

Wheat 
(cwts. 
per acre) 

Potatoes 
(tons 
per acre) 

Bedford 

16-0 

5*3 

Northampton 

14*3 

4*9 

fluntingdon 

Cambridge 

16-0 

6*6 

Peterborough 

14*4 

5*6 

16*4 

6*1 

Buckingham 

15*2 

6*4 

Ely 

20*5 

5*5 

Oxford 

14*1 

6*9 

Suffolk, West 

18*2 

6*9 

Warwick 

15*4 

5*6 

Suffolk, East 

16*3 

6*1 

Shropshire 

16*5 

6*1 

Essex 

17*7 

6*4 

Worcester 

14*2 

5*7 

Hertford 

15*3 

6*3 

Gloucester 

13*2 

5*0 

Middlesex 

16*5 

7*8 

Wiltshire 

13-8 

6*5 

Norfolk 

16*9 

8*3 

Hereford 

14*4 

6*2 

Lincoln (Holland) 

21*8 

5*7 

Somerset 

13*4 

5*2 

„ (Kestoven) 

15*5 

6*2 

Dorset 

11*2 

6*6 

,, (Lindsey) 
Yorkshire 

15*8 ' 

6*0 

Devon 

14*4 

5*8 

16*1 

6*1 

Cornwall 

15*4 

6*3 

(East Kiding) 
Kent 

18*5 

6*6 

Northu mbcrland 

18*5 

6*3 

Surrey 

12*7 

4*8 

Durham 

16*4 

5*8 

Sussex (East) 

15*7 

4*9 

Yorkshire (N.R.) 

17*0 

5*9 

Sussex (West) 

14*3 

5*1 

.. (W.R ) 

Cumberland 

16*9 

6*5 

Berkshire 

13*8 

5*5 

17*5 

5*8 

Hampshire 

12*8 

6*7 

Westmorland 

15*8 

5*7 

Isle of Wight 

12*0 

6*5 

Lancashire 

19*2 

7*2 

Nottingham 

!5*6 

5*2 

Cheshire 

17*7 

6*5 

Leicester 

15-8 

5*2 

Derby 

15*2 

5*4 

Rutland 

16*6 

7*1 

Stafford 

17*1 

6*3 


Wheat (cwts.) Potatoes (tons) 

Bedfordshire and Huntingdonshire 16*0 5 *95 

Cambridgeshire and Ely 18*45 5*80 

Suffolk West and Suffolk East .... 17*25 6*5 

X3.3 We have thus found correlations ranging from 0*2189 to 0*9902. 
Nor is this all. We may well expect that if our 48 counties were divided 
into smaller areas the resulting correlation would be smaller than 0*2189. 
On the face of it we seem to be able to produce any value of the correlation 
from 0 to 1 merely by choosing an appropriate size of the unit of area f<Mf. 
.which we measure the yields. Is there then, any ‘‘real*’ correlation 
between wheat and potato*yields or are our results illusory ? 

13.4 This example serves to bring out an important distinction between 
two different types of data to which correlation analysis may be allied. 
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The difficulty does not arise when we are considering the relationship, 
say, between heights of fathers and sons. The ultimate unit in this case 
is the individual father or son whose height is a unique non-modifiable 
numerical measurement. We cannot divide a single pair of father-and- 
son into smaller units ; nor can we amalgamate two pairs to give measure- 
ments of the same type as that of the single pair. The same is true of 
the data of Table 9.1 (correlation between measurements on shells), of 
Table 9.2 (correlation between ages of husband and wife), and of Table 
9.4 (correlation between age and weekly milk-yield of cows) — thfe shell, 
the married couple and the cow are non-modifiable units. \ 

13.5 On the other hand, our geographical areas chosen for the calcmation 
of crop yields are fnodifiable units, and necessarily so. Since it is impossible 
(or at any rate agriculturally impracticable) to grow wheat and potatoes 
on the same piece of ground simultaneously we must, to give our investiga- 
tion any meaning, consider an area containing both wheat and potatoes ; 
and this area is modifiable at choice. A similar effect arises whenever 
we try to measure concomitant variation extending over continuous 
regions of space or time. For example, a regional death-rate must 
necessarily relate to a modifiable geographical area ; and rainfall, regional 
prices, production of goods or services are quantities of the same type. 
In the case where observations are taken over time, examples are imports 
and exports, cost of living, and stock-exchange prices. Suppose, for 
instance, that we are interested in a possible relationship over time between 
the marriage-rate and the wholesale price index, the suggestion being that 
in prosperous times, when the price index is relatively high, more people 
can afford to marry. Are we to correlate figures compiled on a monthly 
basis, a quarterly basis, an annual basis or a triennial basis ? The unit 
of time is essentially modifiable. 

13.6 From the example we have given as to crop- yields it will he clear 
that the magnitude of a correlation will, in general, depend on the unit 
chosen if that unit is modifiable. Our correlations will accordingly 
measure the relationship between the variates for the specified units chosen 
for the work. They have no absolute validity independently of those 
units, but are relative to them. They measure, as it were, not only the 
variation of the quantities under consideration, but the properties of the 
unit-mesh which we have imposed on the system in order to measure it. 

13.7 The student should not now go to the other extreme and claim 
that, since a large range of values of correlation coefficients may be 
obtained according to the choice of a modifiable unit, a particular value 
has no signifiance and that any inquiry based on correlations in the 
modifiable case is useless. It is of some significance to know that the 
wrrelation between wheat- and potato-yields in the 48 counties of England 
in 1936 was 0*2189. A comparison of a series of such values over a 
period of years might well throw light on changes in farm practice or 
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soil fertility ; the correlation and the corresponding regression indicates 
how far we may expect to predict the potato crop from a knowledge of 
the earlier-harvested wheat crop— in this particular case, not very far. 
But we must emphasise the necessity, in this type of work, of not losing 
sight of the fact that our results depend on our units. The point assumes 
particular importance when we are trying to disentangle causal factors. 
It is a fact that wheat- and potato-yields in the 48 counties of England 
were correlated in 1936 ; but it is a geographical as well as an agricultural 
fact. We cannot infer without additional inquiry that soil which produces 
good crops of wheat tends to produce good crops of potatoes. 

The attenuation effect 

13*8 There is a distinct type of grouping-effect in correlation analysis 
which leads to a very similar increase in correlations with increasing 
size of geographical area. Suppose we are interested in the relationship 
between income and size of family in a certain country. Ignoring minor 
difficulties as to what constitutes a family in some cases, we have a non- 
modifiable unit. If time, patience and money were available in sufficient 
quantity we might be able to ascertain the income and family-size for 
each unit in the country ; but in practice (unless we performed an ad koc 
sampling inquiry) we should probably have regard to totals and averages 
available for regions and districts. Wc might, for instance, attempt to 
estimate the mean number per family for census districts and estimate 
the mean income from fiscal or local taxation data. Effectively we should 
then be grouping the non-modifiable units into larger units which are 
themselves, within limits, modifiable. 


13.9 Suppose we have two variables x, y each of which can be regarded 
as the sum of a systematic and a random element 


y=^v+f 


(13.1) 


We may, for example, imagine that there is some causal factor affecting 
g and rj simultaneously and hence resulting in a correlation between x 
and y ; but that other components e and / are unrelated to ^ and 1} 
and to each other. 

Without loss of generality we may suppose that g and e are measured 
about their means, in which case x will also be measured about its mean. 
We then have 


and since £ and e are uncorrelated we have, on dividing by the number 
of the population 

var ;»==!var g'f var e . , . (13.2) 


where we write var x for the variance of x. Equation (13.2) is a particular 
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case of a theorem which we shall consider in more detail in the next chapter 
(14.2). 

Similarly we shall have 

vary=var /7+var/ . . . (13.3) 

and, writing cov (x, y) for the covariance of x and y 

cov ^x,y)=cov (^, 7) . . (13.4) 

Let us now denote the correlation between x and y by r and that between 
i and 7 by r'. We then have 

cov(x.y) 

{var X var jy}* 

^ cov (g, n) 

{(var £+var e) (var i;+var /)}* 

cov (g, 1;) 1_ 

■{var i var 7}* 




(13.5) 


Now a variance is essentially non-negative and hence each part of the 
denominator on the right hand side of (13.5) is greater than unity. Con- 
sequently r is less than r' ; that is to say, a correlation calculated from 
the observed values is reduced, or we may say attenuated by the effect 
of the factors expressed by e and /. 


13.10 Now suppose that we group units, bearing x and y values, either 
geographically or in time. In virtue of a sampling effect which we shall 
study later (Chapter 17) the proportionate variance var c/var f will be 
reduced. For the present we assume this ; but the reader will probably 
accept it as probable from the consideration that systematic effects 
represented by ^ and 7 will be cumulative, whereas random effects 
represented by e and / tend to cancel out — the larger the number of units 
we group, the less, relatively speaking, ivill their total be affected by 
erratic fluctuations. 

It follows that the denominator in (13.5) will also be reduced as we 
increase the size of the grouping ; and consequently, if r' is constant r 
will continually increase as we group more and more individuals. 

13.11 This is the kind of effect we frequently find. It is not necessarily 
due to the system which we have just discussed, though that S3^tem 
^ovides a posdble explanation. There may be other effects such as 
" patchiness " in the total area under ooimderation, which would lead 



CORRELATION AND REGRESSION 3 X 5 

to f' itself changing with increased grouping and might either enhance 
or counteract the effect of grouping on random components. What 
explanation we seek in individual cases depends on the individual cir- 
cumstances. We can only leave the reader with the warning to watch 
very carefully the possibility of grouping effects, particularly in economic 
investigations. 

Example 13.1 — (Gehlke and Biehl,/. Am. Slot. Ass, Supp, 1934, 29, 1^) 
A study was made of the relationship between male juvenile delinquency, 
expressed as absolute numbers, and the median monthly rental in Qeveland, 
Ohio. The 252 census tracts were grouped successively into 200, 175, 
150, 125, 100, 50 and 25 areas, consisting so far as possible of the same 
size ahd comprising contiguous territory. 

The correlation coefficients, including that for the original 252 tracts, ran 
-0*502, -0*569, -0*580, -0*606, -0*662, -0*667, -0*685, -0*763. 
The characteristic increase of correlation with size of area is dear. The 
corresponding correlations between rates of male juvenile delinquency 
and median monthly rentals were —0*516, —0*504, —0*480, - 0*475, 
-0*563, —0*524, -0*579, -0*621. Here the increase is not uniform 
but it begins to appear as the grouping becomes more condensed. 

TABLE 13.2. — Numben of wirclcn receiving licence Issued during the year in die 
U.K. and numbers of notified mental defectives In England and Wales 

(Date <rom Statistical Abstract ior the United Kingdom. Cmd. 5903, 1999} 


Year 

Number of wireless 
receiving licences 
issued (thousands) 

Number of notihed 
mental defectives per 
10,000 of estimated 
population 

1924 

1.350 

8 

1925 

1,960 

8 

1926 

2,270 

9 

1927 

2,483 

10 

1928 

2,730 

11 

1929 

3,091 

11 

1980 

3,647 

12 

1931 

4,620 

16 

1932 

5.497 

18 

1933 

6,260 

19 

1934 

7,012 

20 

1935 

7,618 

21 

1936 

8,131 

22 

1937 

8,593 

23 


NoU : The year lor the purposes of the wireless licence records is 
the fis(^ year April/Mwh ; for the mental defective records 
the census date is January Ist. 

NmiMMe correlations 

13.12 In Table 13.2 we show the number of witness leodving Hcmces 
takra out from 1924 to 1937 in the United Kingdom and the munber of 
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notified mental defectives per 10,000 in England and Wales for the same 
period. A glance at these figures shows that they are very highly 
correlated. The correlation coefficient is, in fact, 0*998. 

Now, facetiousness apart, it cannot be contended that listening to the 
radio conduces to notifiable mental defect or vice-versa. The correlation 
appears to be nonsensical. Before dismissing it as such, however, we 
must concede that the possibility of causal connection cannot be entirely 
excluded. For instance, it might be argued that the period in question 
was one of great technical progress in many scientific fields ; that one 
effect of this movement was the development of broadcasting aAd the 
general spread of the practice of listening evinced by the increased number 
of licences taken out ; that another effect was the greater interest in 
psychological ailments and increased facilities for treatment, resulting 
in either more discoveries of mental defect or greater readiness to suomit 
cases to medical notice. Whether this is the right explanation is doubtful, 
but it is a possible rational explanation of what at first sight seems absurd. 

13.13 The more reasonable explanation is that the strength of the 
correlation is an accident ; and our point will have been made if the 
reader understands what sort of an accident it is. When we consider 
sampling in Chapter 16 et seq, we shall discuss the nature of sampling 
distributions and shall point out that occasionally, by sheer chance, an 
improbable event may arise. In sampling from a bivariate normal 
population, for instance, as we have pointed out above (9.28) a high 
correlation may appear even when the parent is uncorrelated, albeit 
rather rarely. This, however, arises in sampling where members are 
chosen independently. In the case of our nonsense-correlation we have 
taken a sequence of values moving through time, each very dependent on 
the one before. Our present effect, accordingly, is not a sampling fluptua- 
tion as ordinarily understood. 

13.14 It may, none the less, be regarded as accidental. Suppose we 
have two series in time, each of which is moving fairly steadily upwards or 
downwards (i.e. increasing or decreasing more or less uniformly from 
one year to the next). Clearly such series will appear as highly correlated, 
posihvely or negatively, if we happen to chpose for ebnsideration periods 
of time in which the movement of each series is in the same direction. 
But the reasons for the movements may be quite unrelated or at least 
so remote that we cannot claim any "real'' connection between the two 
series. Increased numbers of radio licences are due to the invention of 
radio communication and the steady movement towards the saturation of 
a latent demand. This is probably quite unrelated to the development 
in notifications of mental defectives. It may well be that in a future 
period the numbers of licences may decline with a declining population 
while the numbers of notified defectives increase. 

13.15 It is possible to have nonsense-correlations in space as well as in 
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time, though good examples are hard to find. As we move from north 
to south across Europe, for example, the proportion of Roman Catholics 
in the population probably increases — there are few in Scotland and a 
great many in Sicily. At the same time we should probably find a decrease 
in the average height. If, therefore, we were to correlate height and 
proportion of Catholics (we have not tried the experiment) we should 
probably find quite a substantial negative correlation ; but if so it would 
be obvious nonsense in our present usage of the word. 

Variate-differences 

13.16 Figure 13.1 shows, for the period 1838-1914, the movements of (a) 
the infantile mortality (deaths of infants under one year of age per 1,000 
births in the same year) and (6) the general mortality (deaths at all ages 
per 1,000 living) in England and Wales. A very cursory inspection of 
the diagram shows that the two varied together — when the infantile 
mortality rose from one year to the next the general mortality did the 
same, with only seven or eight exceptions to the rule during the whole 
period under review. The correlation between the annual values of the 
two may be expected to be positive, because the infantile death-rate 
forms part of the general death-rate ; but it would not be very high 
as the general mortality fell more or less steadily from 1875 onwards 
whereas the infantile mortality rose to a peak in 1898. During a long 
period of time the correlation may nearly vanish, for the two mortalities 
are affected by largely different causes. In this sense, a high correlation 
for a short period might be nonsense'' (though this is stretching our 
usage rather far) if it was interpreted as implying a strong causal nexus 
in the long run. 

13.17 To exhibit the closeness of the relation between infantile and 

general mortality for such cauh,3 marked changes from one year 

to the next it will be best to proceed by correlating the annual changes ^ 
and not the annual values. The work would be arranged in the following 
form (only sufficient years being given to exhibit the principle of the 
process), and the correlation worked out between the figures of columns 
3 and 5 — 


1 

Year 

2 

Infantile 
mortality per 
1,000 births 

3 

Increase or 
decrease from 
year before 

4 

General 
mortality per 

1 ,000 living 

5 

Increase or 
decrease from 
year before 

1838 

159 


22-4 


1839 

151 

-8 

21*8 

-0*6 

1840 

154 

H-3 

22*9 

4-M 

1841 

145 

~9 

21*6 

-1*3 

1842 

152 

+7 

21-7 

+0*1 

1843 

150 

-2 

i 

21-2 

1 

-0*5 






1S45 less 2665 1875 2865 IBBS 2S0S 

Years 

Pig* 13,l.~-liifaiitOc and general mortaltly In England and Wales* 1838-1914 
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For the period to which the diagram refers, viz. 1838-1914, the follow- 


ing constants were found by this method — 

Infantile mortality, mean annual change — 0’71 
„ „ , standard deviation 10-76 

General mortality, mean annual change — 0-11 
„ „ , standard deviation 1 • 13 

Coefficient of correlation -f 0-69 


This is a much higher correlation than would arise from the mere fact 
that the deaths of infants form part of the general mortality, and con- 
sequently there must be a high correlation between the annual changes in 
the mortality of those who are over and under 1 year of age, respectively. 

13.18 The procedure of the foregoing section has been called the " variate- 
difference correlation method.” By taking first differences instead of 
the variate values themselves, the slower changes of the two variates 
with time are to some extent eliminated, and we are able to study the 
effect of short-term variations. To eliminate the secular changes more 
completely it may be desirable to proceed to second differences, i.e. to work 
out the successive differences of the differences in column 3 and column 5 
before correlating. It may even be desirable to proceed to third, fourth 
or higher differences before correlating. The method should, however, be 
used-with caution in such cases, particularly with short series. Correlation 
coefficients obtained from higher differences are not always reliable, and 
their interpretation becomes a matter of considerable difficulty. We 
return to the subject later in Chapters 26 and 27 on time-series, where will 
also be found a method more adapted to the case of time-series in which 
wave-like oscillations appear to be imposed on the general trend. 

13.19 When an inquiry involving correlation or regression anal}^ is 
undertaken the variables to be considered are sometimes determined at 
the outset by the nature of the questions which are to be answered. If, 
for example, we are asked to investigate the relationship between the 
annual suicide rate and the annual number of bankruptcies in a particiilar 
country our variables are specified and all that remains is to obtain the 
data and to work on them. There may, indeed, be practical difficulties 
in obtaining the data for the right years or the right areas but this is not 
a matter in which theoretical considerations can help us. 

13.20 More usually, the type of inquiry we are asked to undertake is 
less definitely specified. We may wish to inves%ate the reiationtdiip 
between a number of quantities or factors which are not directly measure- 
able, e.g, the relation between weather and the prevalence of ^idemic 
disease. There is no single measurement corresponding to "weather" 
and we have to select a number of variables to represent it such as tempm^ 
ture, rainfall, or cloudiness. Each of these, in general, may be inodi^ble 
or non-modifiable and we have an additional elmnent of tffioiGe ia ^ 
predae form of the variate which we w^ct. 
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13.21 In the extreme case we may not even know which factors will 

emerge from our analysis as important. Suppose we are interested in the 
factors which encourage or prevent tuberculosis and attempt to throw 
some light on the subject by considering variations in the incidence of 
the disease in different areas. What factors are we to select as in- 
dependent ? It is easy to write down a long list of possible factors — 
income, overcrowding, rainfall, sunshine, height above sea-level and so 
forth. Assuming for the moment that we can measure all these factors, 
how far do we have to take them into account, and can we do so vjithout 
rendering the analysis quite unwieldy ? i 

There is no simple answer to these questions. In the remainder W the 
chapter we shall give a short account of some of the resources at the 
investigator's disposal in particular cases. \ 

A practical example 

13.22 Some of the questions which arise are illustrated in an investigation 
by Hooker {J, R. Stat, Soc. 1907, 65, 1) into the relationship between the 
yield of certain crops (cereals, roots and hay) and the weather. 

The material question here was how far crop-yields in the same area 
vary with the weather. Geographical variation was therefore not in 
point, and Hooker considered the series of values over a period of years 
for a single area. Climatic, soil, and farm-practice conditions vary so 
much over the United Kingdom that any attempt to take geographical 
variation into account would have complicated the analysis enormously. 
By choosing one area we eliminate some of the variables and can con- 
centrate on climatic factors. Our gain in simplicity may, of course, be 
offset by loss of generality — we cannot assume that our results will hold 
good for other areas where different conditions exist. We must also be 
careful to ascertain that, even in the area under consideration, our s^mes 
of years is not so long that there are material changes which would Obscure 
climatic effects, such as exhaustion of soil fertility or a switch from arable 
to grass farming, 

13.23 There then arises the problem of selecting the appropriate area. 
The desiderata are (1) that it should be reasonably homogeneous from the 
meteorological standpoint and (2) it should be Urge enough to present 
a representative variety of soil. Hooker chose a group of eastern counties, 
consisting of Lincoln, Huntingdon, Cambridge, Norfolk, Suffolk, Essex, 
Bedford and Hertford, as fulfilling these conditions. The group included 
the county with the largest acreage of each of the ten crops investigated 
with the single exception of permanent grass. 

13.24 Produce statistics for the more important crops of England and 
Wales have been issued by the Ministry of Agriculture since 1885. The 
figures are based on estimates of yield furnished by lobal official estimators 
all over the country. Estimates are published for separate counties and 
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for groups of counties (divisions), but not for smaller units of area, though 
the crop estimators usually submit returns for parishes. 

The data in this case are thus provided by the official publications. 
Their nature limits the inquiry in space (since we must choose areas based 
on counties) and in time (since figures are not available prior to 1885). 
We must also assume that the estimates are reasonably accurate. The 
field of choice in most economic inquiries is limited by such factors as 
these. 

13.25 Having decided on our crop-figures we have to consider the weather 
factors. The produce of a crop is dependent on the weather of a long 
preceding period, and it is naturally desired to find the influence of the 
weather at successive stages during this period, and to determine, for 
each crop, which period of the year is of most critical importance as regards 
weather. It must be remembered, however, that the times of both sowing 
and harvest are themselves very largely dependent on the weather, and 
consequently, on an average of many years, the limits of the critical period 
will not be yery well defined. If, therefore, we correlate the produce of the 
crop (X) with the characteristics of the weather (7) during successive 
intervals of the year, it will be as well not to make these intervals too short. 
It was accordingly decided to take successive groups of 8 wrecks, overlap- 
ping each other by 4 weeks, i.e. weeks 1 ~8, 5-12, etc. Correlation coefficients 
were thus obtained at 4-week intervals, but based on 8 weeks* weather. 

13.26 Finally, we have to decide what measurable characteristics of the 
weather are to be taken into account. Prior knowledge suggests that 
the two most important are rainfall and temperature. The two provide 
quite enough labour for a first investigation. 

(а) The rainfall for a particular county is to some extent a modifiable 
unit, for no measurements are taken of the total precipitation on a given 
area. Hooker took records of weekly rainfall from eight stations within 
the total area under consideration and used the average of these figures 
as the first characteristic of the weather. 

(б) Temperatures were taken from the records of the same stations. 
The average temperatures, however, do not give quite the sort of informa- 
tion that is required : at temperatures below a certain limit (about 42® 
Fahr.) there is very little growth, and the growth increases in rapidity 
as the temperature rises above this point (within limits). It was therefore 
decided to utilise the figures for accumulated temperatures above 42® 
Fahr.,** i.e. the total number of day-degrees above 42® during each of the 
8-weckly periods, as the second characteristic of the w^eather; these 
** accumulated temperatures,** moreover, show much larger variations than 
mean temperatures. 

Reference should be made to Hooker’s paper for a more detailed account 
of the inquiry and its results. 
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Economy in 0ie number of variables 

13^ In the agricultural case we have just considered there was a 
large body of prior knowledge available to assist in determining the field 
of inquiry and the variables which were likely to give significant and 
meaningful results. This is not always the case. In discussing the 
geographical variation of mortality our prior knowledge would suggest 
considering as independent variates such factors as age-distribution, 
proportion of males and density of population. We could, however, 
without difficulty extend the list of possible factors almost indefinitely, 
e.g. by including hours of sunshine, wage levels, adequacy of medical 
attention and standards of nutrition. In an investigation into the 
variation of crime among American cities Dgburn (/. Am. Stat. Asi. 1935, 
30 , 12) listed no fewer than 26 factors including birth-rate, proportion of 
negroes and proportion of foreign-born immigrants, as well as the\more 
obvious ones such as efficacy of the police system and proportion of nudes. 

13.28 With adequate data and sufficient patience, of course, we can 
work out the regression of our variable on all these others. But the 
practical difficulties, including those of computation, are prohibitive ; 
and sometimes there are theoretical difficulties into the bargain. The 
reader who consults some earlier inquiries in which arithmetical en- 
thusiasm was not tempered by common sense will find that there are 
more variables than observations and that the resulting high correlations 
may mean next to nothing. In any case, ten variables are about as many 
as can be conveniently managed, and even that number throws a severe 
strain on the computer. 

13.29 It is therefore necessary at an early stage to economise in the 
number of variables — 

(a) As in the agricultural example we may limit the scope of the inquiry. 
This is what the physicist does in the laboratory by holding other factors 
as constant as experimental conditions will allow. By taking a particular 
factor as constant (within reasonable limits) we may ignore its effect on 
the regression equation. Subject to practical limitations we exclude in 
this way those factors which are expected to have the least effect. We 
can always bring them into account later one by one if necessary. 

(b) Certain of the variables may be grouped and expressed, at least 
approximately, in terms of one of them or of some other summarising 
coefficient. In considering the relationship between employment and 
retail prices, for instance, we need not bring into account as a separate 
variate every retail commodity entering into the household budget. An 
index of retail prices would probably be quite sufficient. Again, in a 
mortality inquiry we might suppose that ability to pay for medical 
attention and standards of nutrition were sufficiently closely linked to 
wage-levels to justify us in using wage-levels to represent capacity to 
pay the doctor’s bills and to buy enough food. 
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(c) As we have already mentioned, we may proceed by selecting two or 
three of the most promising variables to see whether the regression line 
containing them satisfactorily accounts for the data (as judged, for 
example, by the magnitude of the multiple correlation coefficient.) If 
it does not we may add further variates until a good fit is obtained. 

13.30 To conclude this chapter we may refer to some approaches to the 
problem of statistical relationship which have been developed for particular 
purposes but are capable of more general application. 

A regression equation expresses the " best " linear relationship between 
a dependent variable and a set of given independent variables, " best " 
in this connection being somewhat arbitrarily defined by minimising a 
certain sum of squares. Let us look at this geometrically. Given a set 
of points in n dimensions where n is the total number of variables, depen- 
dent and independent together, we find as the regression of one on the 
others that plane which lies closest to the points ; " closest ” being defined 
so as to minimise the sum of squares of distances from the points to the 
place in the direction parallel to the axis of the dependent variate. The 
student can picture this situation easily enough in the two- and three- 
dimensional case ; and further dimensions, though impossible to imagine 
spatially, add nothing new to the principles. 

13.31 Now our cluster of points, though specified by means of n variables 
and hence in an n dimensional space, may in fact lie, at least approximately, 
in a space of fewer dimensions. For instance the cluster of points of 
Figure 12.1 (l)dng in three dimensions) might perhaps lie on a plane or 
even on a line. We may, therefore, be able to find new variables, ex- 
pressible as linear functions of the old, which represent the data equally 
well but require fewer independent variables. 

The approach is one aspect of the subject known as factor analysis. It 
seeks to isolate, from a complex of variables, a small number of factors 
which will account for most of the variation. We cannot give here any 
indication of the various techniques which have been developed, mainly 
in psychology, to carry out the analysis, for most of them involve advanced 
mathematics as well as some complicated theoretical problems. The 
reader who wishes to pursue the subject may refer to Factor Analysis bv 
Holzinger and Harman or to Kendall's A Course in Multivariate Analysis, 
1957. 

13.32 A somewhat different line of inquiry known as confluence anafytie 
has been followed by Scandinavian writers, mainly by Ragnar Frisch. 
This involves heavy calculations and in effect, depends on working out 
all the possible regressions in order to see how far the appearance of a new 
variate disturbes the previous coefficients. For some account of the 
method see Frisch’s Confluence Analysis, 1934 (Oslo) and Reierset, 
Bconom^ica, 1941, 2, 1. 
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SUMMARY 

1. Units may be modifiable or non-modifiable. For modifiable units 
the values of correlations depend on the size of the units and must be 
interpreted accordingly. 

2. When units are grouped and correlations calculated from some 
summary features of the group, such as averages, there may be a tendency 
for the correlations to increase with the size of the grouping. Conversely 
as the grouping becomes finer the coefficients may be attennateej. 

3. Correlations for series which are developing in time may De mis- 
leadingly high if the series accidentally happen to move togetheil^ 

4. To elucidate short-term variation in time-series it may be preferable 
to correlate changes from one period to the next rather than the Actual 
values of the series. This conception is the origin of the variate-difference 
method which must, however, be used with great caution. 

5. In a general inquiry involving correlation or regression anatysis 
efforts are necessary to economise in the number of independent variables. 


EXERCISES 

13.1 Examine how far Tables 9.S and 9.6 are based on modifiable units. 

13.2 The following table shows, for the United Kingdom, the population 
and the infantile mortality for certain years — 


Year 

Population 

(000) 

1871 

31,485 

1881 

34,885 

1891 

37.7.33 

1901 

41,459 

1911 

45,222 

1921 

47,123 

1931 

47,289 


Deaths of infants per 1 ,000 
births approx, at census date 

144 

134 

141 

140 

108 

81 

67 


Show that the values are correlated. How far would you regard this as 
a nonsense-correlation ? 

(Data from the Statistical Abstract for the U.K.Cmd. 5908, 1939. The 
figures for 1931 exclude the territory now forming Eire but this may be 
ignored for the purpose of the example.) 

13.3 The following table shows the number of steam ships registered as 
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belonging to the United Kingdom and the receipts from horse-drawn 
vehicle-licenses in Great Britain for certain years — 


Year 

Number of steam 
vessels 

Receipts from 
horse-drawi 

1924 

10,690 

140,719 

1925 

10,526 

118,847 

1926 

10,262 

98,459 

1927 

10,032 

80,302 

1928 

9,959 

64,675 

1929 

9,855 

51,199 

1930 

9,729 

40,878 

1931 

9,529 

32,303 

1932 

9,248 

25,700 

1933 

8,900 

21,288 

1934 

8,622 

17,661 

1935 

8,306 

14,481 

1936 

8,032 

11,579 

1937 

7,702 

9,177 


Bearing in mind the development of diesel-propelled ships and of the 
motor car, consider how far the correlation between these figures may be 
regarded as nonsense. 



CHAPTER FOURTEEN 


MISCELLANEOUS THEOREMS INVOLVING 
THE CORRELATION COEFFICIENT 


Algdvaical convenience of the correlation coefficient 
14J. It has already been pointed out that a statistical measure, if it 
is to be widely useful, should lend itself readily to algebraical tre^ment. 
The arithmetic mean and the standard deviation derive their importance 
largely from the fact that they fulfil this requirement better than any other 
averages or measures of dispersion ; and the following illustrations, while 
giving a number of results that are of value in one branch or another 
of statistical work, suffice to show that the correlation coefficient can be 
treated with the same facility. This might indeed be expected, seeing 
that the coefficient is derived, like the mean and standard deviation, by a 
straightforward process of summation. 

The standard deviation of the sum or difference of variables 

14.2 Let Xi, Xf be two variables, and Z stand for their sum or difference. 

Let z. * 1 , denote deviations of the several variables from their 
arithmetic means. Then, if 

Z=X^±X, 

evidently 


X^Xi±Xt 


Squaring both sides of the equation and summing, 
S(2 *)=S(*.*)+S{x,*)±22:(x,x,) 

That is, if f be the correlation between Xj and x„ and o, Oj, o, the respective 
standard deviations, 

.... (14.1) 

If Xi and X| are uncorrelated, we have the important special case 

o*=ai‘+o,* .... (14.2) 

The student should notice that in this case the standard deviation of 
tile snm of corresponding values of the two variables is the same as the 

3*6 
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standard deviation of their difference. If we write var X for the variance 
of X and cov {X, Y) for the covariance of X and Y we may express (14.1) 
as 

var {X± Y) =var X+ var Y±2 cov {X. Y) (14.3) 

and (14.2) as 

var (X'iYj^var X+ var Y . . , (14.4) 

The same process will evidently give the standard deviation of a linear 
function of any number of variables. For the sum of a series of variables 
Xi, Xf, . . . X^, we must have — 

o»=Oi*+ag*+ . . . +CT/+2rig(yj<Tj+2r,8aja, 

+ • • • +2>'g30'808"i" • • • 

r|j being the correlation between Xi and X^, r*, the correlation between 
Xt and X^, and so on. 

Influence of errors of observation on the standard deviation 

14.3 The results of 14.2 may be applied to the theory of errors of 
observation. Let us suppose that, if any value of X be observed a large 
number of times, the arithmetic mean of the observations is approximately 
the true value, the arithmetic mean error being zero. Then, the arithmetic 
mean error being zero for all values of X, the error, say, d, is uncorrelated 
with X. In this case, if Xi be an observed deviation from the arithmetic 
mean, and x the true deviation, we have from the preceding — 

var =var x-f-.var S (14-5) 

The effect of errors of observation is, consequently, to increase the standard 
deviation above its true value. The student should notice that the 
assumption made does not imply the complete independence of X and d : he 
is quite at liberty to suppose that errors fluctuate more, for example, with 
large than with small values of X, as might very probably happen. In 
that case the contingency coefficient between X and 4 would not be aero, 
although the correlation coefficient might still vanish as snpp>osed. 

14.4 If certain observations be repeated so that we have in every case 
two measures x^ and x, of the same deviation x, it is poi^ble to obtain 
the true standard deviation a, if the further assumption is legitimate that 
the errors 4^ and 4g are uncorrelated with each other. On this assumption 

2(*i**) =2(* +4i) (x +4,) 

=S(x*) 


g S (xgx,) 

N 


(I4J) 


and accordingly 


var x==a, 
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(This formula is part of Spearman's formula for the correction of the 
correlation coefficient ; cf. 14.6.) 

Influence of errors of observation on the correlation coefficient 

14.5 Let yi be the observed deviations from the arithmetic means, 
X, y the true deviations, and e the errors of observation. Of the four 
quantities x, y, S, e we will suppose x and v alone to be correlated. On this 
assumption 

• • . • ( • ( 14 . 7 ) 

It follows at once that 

Txy Ox^<Jy^ \ 

GxCTy \ 

and consequently the observed correlation is less than the true correlation. 
This difference, it should be noticed, no mere increase in the number of 
observations can in any way lessen. 

Spearman’s theorems 

14.6 If, however, the observations of both x and y be repeated, as 
assumed in 14.4, so that we have two measures x^ and x^, andyg oi every 
value of X and y, the true value of the correlation can be obtained by the 
use of equations (14.6) and (14.7), on assumptions similar to those made 
above. For we have — 




•'(14.8) 


Or, if we use all the four possible correlations between observed values of 
X and observed values of y. 




(14.9) 


Equation (14.9) is the original form in which Spearman gave his correc- 
tion formula. It will be seen to imply the assumption that, of the six 
quantities x, y, ii, d,, e,, e,, only x and y are correlated. The correction 
given by the second part of equation (14.8), also suggested by Spearman, 
seems, on the whole, to be safer, for it eliminates the assumption that the 
errors in * and in y, in the same series of observations, are uncorrelated. 
An insufficient though partial test of the correctness of the assumptions 
may be made by correlating with y,— y, : this correlation should 

vani^. Evidently, however, it may vanish from symmetry without 
thereby impl}dng that all the correlations of the errms are zero. 
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Mean and standard deviation of an index 

14.7 The means and standard deviations of non-linear functions of 
two or more variables can in general only be expressed in terms of the means 
and standard deviations of the original variables to a first approximation, 
on the assumption that deviations are small compared with the mean values 
of the variables. Thus, let it be required to find the mean and standard 
deviation of a ratio or index Z =Xi /Xj, in terms of the constants for and 
Xj. Let I be the mean of Z, and M* the means of and X,. Then, 


r 



~N 



Expand the second bracket by the binomial theorem, assuming that 
XiJMi is so small that powers higher than the second can be neglected. 
Then, to this approximation. 


I 


1 Mj 

N Mi 


N- 


A/,M 


■'L(xxi) 



That is, if r be the correlation between and and if Vi—crj /M^, 

OtIMt, 

M 

(14.10) 


If s be the standard deviation of Z, we have- 




n\xJ 


1 

'N M/ 




Expanding the second bracket ageiin by the binomial theorem, and neglect- 
ing terms of all orders above the second — 




1 

N M 


V \ Mi*) 


M 2 

or from (14.10)- 

M > 


which we may also write as 


var(X,/X,) 


A/,*j var Xi 

""Af/t ~M\* 


2 cov (X,, Xj) var Xj | 


(14.11) 


(14.12) 
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Conrelatioii between indices 

14.8 The following problem affords a further illustration of the use of 
the same method. Required to find approximately the correlation between 
two ratios Zi=XilXf, Z^^X^jX^, X^, X, and X, being uncorrelated. 

Let the means of the two ratios or indices be I^, I^, and the standard 
deviations s^, s, ; these are given approximately by (14.10) and (14.11) of 
the last section. The required correlation p will be given by — 



Neglecting terms of higher order than the second as before and re- 
membering that all correlations are zero, we have — 




ilf.Af 

M,* 


*(l-h3t;.*)-/,/ 


1 


M^M 




where, in the last step, a term of the order v,* has again been neglected. 
Substituting from (14.11) for s, and s„ we have finally — 


» 8 * 


(14.13) 


This value of p is obviously positive, being equal to 0*5 if ; 

and hence even if X^ and X, are independent, the indices formed by takii^ 
their ratios to a common denominator X, will be correlated. The value of 
p was termed by Karl Pearson the “ spurious correlation." Thus, if 
measurements be taken, say, on three bones of the human skeleton, and the 
measurements grouped in threes absolutely at random, there will, neverthe- 
less, be a positive correlation, probably approaching 0*5, between the 
indices formed by the ratios of two of the measurements to the third. To 
give another illustration, if two individuals both observe the same series 
of magnitudes quite independently, there may be little, if any, correlation 
between their absolute errors. But if the errors be expressed as percent- 
ages of the magnitude observed, there may be considerable correlation. 
It does not follow of necessity that the correlations between indices or 
ratios are misleading. If the indices are uncorrelated, there will be 
a similar " spurious " correlation between the absolute measurements 
ZtX««Xj and ZsX|SsX(, and the answer to the question whether the 



MISCBLtANEOUS THEOREMS 


33 ^ 

correlation between indices or that between absolute measures is mis* 
leading depends on the further question whether the indices or the absolute 
measures are the quantities directly determined by the causes under 
investigation. 

The case considered, where X^ are uncorrelated, is only a 

special one ; for the general discussion see K. Pearson, Proc. Roy. Soc. 
1897, 60, 489. For an interesting study of actual illustrations see J. W, 
Brown and others, J. Roy. Stat. Soc., 1914, 77, 317. 

Correlation due to heterogeneity of material 

14.9 The following theorem offers some analogy with the theorem of 
2.26 for attributes : 1/ X and Y are uncorrelated in each of two records, they 
will nevertheless exhibit some correlation when the two records are mingled, 
unless the mean value of X in the second record is identical with that in the first 
record, or the mean value of Y in the second record is identical with that in the 
first record, or both. 

This follows almost at once, for if M^, are the mean values of X in 
the two records, Ki, K 2 the mean values of Y, iVj, the numbers of 
observations, and M, K the means when the two records are mingled, the 
product-sum of deviations about M, K is — 


-if) +N2{M2-M){K2-K) 


Evidently the first term can only be zero if M=Mi or K=^Ki. but 
the first condition gives — 


that is. 


M^=M, 


Similarly, the second condition gives Ki=K,. Both the first and second 
terms can, therefore, only vanish if =M, or =K,. Correlation may 
accordingly be created by the mingling of two records in which X and Y 
vary round different means. 


Reducticm of correlation due to mingling of uncorrelated with conelated 
pairs 

14.10 Suppose that % observations of x and y. give a corrdation 
coefficient — 


Now, let n, pairs be added to the material, the means and standard ^^tevia- 
tions of X and y being the same as in the first series of obamvatun^, but the 
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correlation aero. The value of will then be unaltered, and we shall 
have — 


Whence 


_ ^{xy) 




f (14.14) 


Suppose, for example, that a number of bones of the human skele^n have 
been disinterred during some excavations, and a correlation r, is observed 
between pairs of bones presumed to come from the same skeletdn, this 
correlation being rather lower than might have been expected, and Subject 
to some uncertainty owing to doubts as to the allocation of certain bones. 
If fi is the value that would be expected from other records, the difference 
might be accounted for on the hypothesis that, in a proportion {r^ — r,) /r^ 
of all the pairs, the bones do not really belong to the same skeleton, and 
have been virtually paired at random. 

The weighted mean 

14.11 The arithmetic mean A/ of a series of values of a variable X was 
defined as the quotient of the sum of those values by their number N, or 


M=S(X) /JV 


If, on the other hand, we multiply each individual observed value of X 
by some numerical coefficient or weight W, the quotient of the sum of such 
products by the sum of the weights is defined as a weighted mean of X, and 
may be denoted by M ' ; so that 


M’=I,{WX)fL(W) 

The distinction between " weighted ” and " unweighted ” means is, 
it should be noted, very often formal rather than essential, for the 
" weights ” may be regarded as actual, estimated or virtual frequencies. 
The weighted mean then becomes simply an arithmetic mean, in which 
some new quantity is regarded as the unit. Thus, if we are given the means 
Afj, Afj, Mj. ... Mr ol r series of observations, but do not know the 
number of observations in every series, we may form a general average by 
taking the arithmetic mean of all the means, viz. r(M) ,V, treating the series 
as the unit. But if we know the number of observations in every series it 
will be better to form the weighted mean /S{JV), weighting each mean 

in proportion to the number of observations in the series on which it is 
based. The second form of average would be quite correctly spoken of as 
a weighted mean of the means of the several series : at the same time, it 
is simply the arithmetic mean of all the series pooled together, i.e. the 
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arithmetic mean obtained by treating the observation and not the series 
as the unit. 

14.12 To give an arithmetical illustration, if a commodity is sold at 
different prices in different markets, it will be better to form an average 
price, not by taking the arithmetic mean of the several market prices, 
treating the market as the unit, but by weighting each price in proportion 
to the quantity sold at that price, if known, i.e. treating the unit of quantity 
as the unit of frequency. Thus, if wheat has been sold in market A at an 
average price of 29s. Id. per quarter, in market B at an average price of 
27s. 7d. and in market C at an average price of 28s. 4d., we may, if no 
statement is made as to the quantities sold at these prices (as very often 
happens in the case of statements as to market prices), take the arithmetic 
mean (28s. 4d.) as the general average. But if we know that 23,930 qrs. 
were sold at A, only 26 qrs. at B and 3,933 qrs. at C, it will be better to 
take the weighted mean 

(29s. Id. X 23,930) (27s. 7d. X 26) + (28s. 4d. X 3,933) 

'27,889 ■ ' - —9s. 

to the nearest penny. This is appreciably higher than the arithmetic mean 
price, which is lowered* by the undue importance attached to the small 
markets B and C. 

14.13 In the case of index- numbers for exhibiting the changes in average 
prices from year to year, it may make a sensible difference whether we 
take the simple arithmetic mean of the index-numbers for different 
commodities in any one year as representing the price-level in that year, 
or weight the index-numbers for the several commodities according to 
their importance from some point of view. It, for example, our standpoint 
be that of some average consumer, we may take as the weight for each 
commodity the sum which he spends on that commodity in an average 
year, so that the frequency of each commodity is taken as the number of 
shillings or pounds spent thereon instead of simply as unity. We revert 
to this topic in Chapter 25. 

14.14 Rates or ratios like the birth-, death- or marriage-rates of a country 
may be regarded as weighted means. For, treating the rate for simplicity 
as a fraction, and not as a rate per 1,000 of the population, 

Total births 

Birth-rate of whole country 

Total population 

JL (Birth-rat e in each district x population in that district) 

E (Population of each district) 

i.e. the rate for the whole country is the mean of the rates in the different 
districts, weighting each in proportion to its population. We use the 
weighted and unweighted means of such rates as illustrations in 14^16 
Wow. 
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14*15 It is evident that any weighted mean will in general differ from 
the unweighted mean of the same quantities, and it is required to find an 
expression for this difference. If r be the correlation between weights and 
variables, Ow and cr, the standard deviations and w the mean weight, we 
have at once 


I,{WX)^N(Mw+rGu<Tx) 

whence 


,(14.15) 

That is to say, if the w^eights and variables are positively correlated, the 
weighted mean is the greater ; if negatively, the less. In some ca^s r is 
very small, and then weighting makes little difference, but in others the 
difference is large and important, r having a sensible value and jw a 
large value. 

14*16 The difference between weighted and unweighted means of death- 
rates, birth-rates or other rates on the population in different districts 
is, for instance, nearly always of importance. For instance, in 1941, the 
birth-rates per 1,000 civilian population in Lancashire were — 


County Boroughs ... 

.. 16-1 

Urban Districts 

.. 14-7 

Rural Districts 

14-4 

The mean value of these three is 15*07 whereas the birthrate for l^nca- 
shire as a whole was 15*5, a reflection of the well-known fact^that the 

more populous areas have the higher birth-rate. The death-rates, ex- 

eluding civilian war-deaths, were — 


County Boroughs ... 

. 15-6 

Urban Districts 

. 13-2 

Rural Districts 

.. ll-O 


with a mean of 13*27, against a (weighted) mean for the whole county 
of 14 ‘5. There appears to be a positive correlation between death-rate 
and size of population as well as between birth rate and population, 
though no doubt for different reasons. Urban aggregations have a larger 
proportion of the young than rural areas, and hence a higher birth-rate, 
but on the other hand living conditions are more unfavourable to life 
and this factor outbalances the effect of the more favourable age-corn* 
position on the death-rate* 

Age^omposition may exert a similar effect on marriage rates. For 
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instance, persons married per 1,000 in the regions of England and Wales 
in 1941 were as follows — 


South East .... 

.... 21*6 

North I 

.... 19*5 

North II .... 

.... 19*0 

North III 

.... 19*9 

North IV .... 

.... 19*9 

Midland I .... 

.... 20*0 

Midland II .... 

.... 19*2 

East 

.... 19*0 

South-west 

.... 17*2 

Wales I 

.... 20*1 

Wales II .... 

.... 16*3 


The mean of these figures is 19*25 whereas the marriage rate for the 
whole country was 20*1. The explanation is that the more populous 
areas contain a greater proportion of younger people and hence have a 
higher marriage-rate. 


14.17 The principle of weighting finds one very important application 
in the treatment of such rates as death-rates, which are largely affected 
by the age and sex composition of the population. Neglecting, for 
simplicity, the question of sex, suppose the numbers of deaths are noted 
in a certaun district for, say, the age-groups 0—, 10—, 20—, etc., in which 
the fractions of the whole population are p^, p^, etc., where S(^)=l. 
Let the death -rates for the corresponding age-groups be d^, d^, etc. Then 
the ordinary or crude death-rate for the district is 


D=^1.{dp) (14.16) 

For some other district taken as a basis of comparison, perhaps the 
cotmtry as a whole, the death-rates and fractions of the population in the 
several age-groups may be 5,, . . ., iti, 7r„ tt,, . . ., and the crude 

death-rate 


A=S(«»r) (14.17) 

Now, D and A differ either because the d's and 4*s differ or because 
the p’s and w’s differ, or both. It may happen that really both districts 
are about equally healthy, and the death-rates approximately the same 
for all age-classes, but, owing to a difference of weighting, the first average 
may be markedly higher than the second, or vice versa. If the first 
district be a rural district and the second urban, for instance, there will be 
a larger proportion of the old in the former, and it may possibly have a 
higher crude death-rate than the second, in spite of lower death-rates in 
every class. The comparison of crude death-rates is therefore liable to 
le»i to erroneous conclusions. The difficulty may be got ov^ by aveiagiiig 
ttie age-dass death-rates in the disUict not with the wdghta p^ « 
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given by it* own population, but with the weights Wj, fig, , . . given 
by the population of the standard district. The standardised death-rate 
for the district will then be 

D’^^[d7f) (14.18) 

and D' and A will be comparable as regards age-distribution. There is 
obviously no difficulty in taking sex into account as well as age if necessary. 
The death-rates must be noted for each sex separately in every, age-class 
and averaged with a system of weights based on the standard pcjpulation. 
The method is also of importance for comparing death-rates inldifierent 
classes of the population, e.g. those engaged in given occupaiions, as 
well as in different districts, and is used for both these purpose^ in the 
publications of the Registrar-General for England and Wales. 

14.18 Difficulty may arise in practical cases from the fact that the 

death-rates d^, d^, known for the districts or classes which 

it is desired to compare with the standard population, but only the crude 
rates D and the fractional populations of the age- classes Pz» . . . 

The difficulty may be partially obviated {cf. 2.30 and Example 2.10, 
pp. 38-40) by forming what is termed an index death-rate A' for the class 
or district, A' being given by 

A'^Z(Sp) (14.19) 

i.e. the rates of the standaid population averaged with the weights of 
the district population. It is the crude death-rate that there would be in 
the district if the rate in every age-class were the same as in the standard 
population. An approximate standardised death-rate for the district or 
class is then given by 

Z)'=Z)x^, . , . . (14.20) 

D" is not necessarily, nor gcneially, the same as D\ It can only be the 
same if 

^dn) Z{Sn) 

Wp)Ym 

This will hold good if, e.g., the death-rates in the standard population 
and the district stand to one another in the same ratio in all age-classes, 
i.e, This method of standardisation was used 

in the Annual Summaries of the Registrar-General for England and Wales. 

14.19 Both methods of standardisation — that of 14.17 and that of 
14.18 — are of great importance. They are obviously applicable to other 
rates besides death-rates, e.g. birth-rates. Further, they may readily be 
extended into quite different fields. Thus it has been suggested that 
standardised average heights or standardised average weights of the children 
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in different schools might be obtained on the basis of a standard school 
population of given age and sex composition, or indeed of given composi- 
tion as regard hair- and eye^colour as well. 


14,20 In 14.11*14.16 we have dealt only with the theory of the weighted 
arithmetic mean, but it should be noted that any form of average can be 
weighted. Thus a weighted median can be formed by finding the value 
of the variable such that the sum of the weights of lesser values is equal 
to the sum of the weights of greater values. A weighted mode could 
be formed by finding the value of the variable for which the sum of the 
weights was greatest, allowing for the smoothing of casual fluctuations. 
Similarly, a weighted geometric mean could be calculated by weighting 
the logarithms of every value of the variable before taking the arithmetic 
mean, i.e. 


log Gii.= 


S(W log X) 
S(ir) 


SUMMARY 

1. The standard deviation of the sum of variables Xi, Xg, . . . X^ 
is given by 

a*=ai*-fcr,*-|- . . . -}-aN*-|-2ruOia,+2ri,aia,+ . . . •+-2r„a,o,+ . . . 
which may also be written 

var {Z(A-)}=S(var X)+7:{cov{X(. Xi)}. i^Aj 

2. In particular, the variance of the sum of N uncorrelated variates is 
the sum of their variances. 

X X 

3. If Xg, Xg and X, are uncorrelated, the indices will neverthe- 

Ag Ag 

less be correlated in general. 

4. If X and Y are uncorrelated in each of two separate records, they 
will be correlated in the sum of the two records, unless either the means 
of X or the means of Y, or both, are the same in the two records. 

5. If correlated and uncorrelated material is mingled, the correlation 
in the total is lower than that in the correlated portion. 

6. An arithmetic mean is weighted when, in the calculation of ^(X), 
each value of the variate is multiplied by a weight W. 

7. The weighted arithmetic mean is greater or less than the unweighted 
mean accord^ as the weights and variables are positively or natively 
correlated. 
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EXERCISES 

14.1 (Data from the Decennial Supplements to the Annual Reports of the 
Registrar-General for England and Wales.) The following particulars 
are found for 96 small registration districts in which the number of births 
in a decade ranged between 1,500 and 2,500 — 


Decade 

Proportion of male births 
per 1«000 of all births 

Mean 

Standard 

deviation 

ISSMSifO . 

508*1 

12*80 

1891-1900 . 

508*4 

10*37 

Both decades 

508*25 

i 

11*65 


It is believed, howeven that a great part of the observed standard 
deviation is due to mere " fluctuations of sampling ” of no real significance. 

Given that the correlation between the proportions of male births in a 
district in the two decades is -fO-36, estimate (1) the true standard devia- 
tion freed from such fluctuations of sampling ; (2) the standard deviatioir 
of fluctuations of sampling, i.e. of the errors produced by such fluctuations 
in the observed proportions of male births. 

14.2 The coefficients of variation for breadth, height and length of 
certain skulls are 3 '89, 3 -50 and 3*24 per- cent respectively. Find the 
" spurious correlation ” between the breadth /length and height /length 
indices, absolute measures being combined at random so that th^ are 
uncorrelated. 

14.3 (Data from Boas, communicated to Pearson ; cf. Fawcett and 
Pearson, Proc. Roy, Soc., 62, p. 413) From short series of measurements 
on American Indians, the mean coefficient of correlation found between 
father and son, and father and daughter, for cephalic index, is 0*14; 
between mother and son, and mothw and daughter, 0*33. Assuming 
these coefficients should be the same if it were~not for the looseness of 
family relations, find the proportion of children not due to the reputed 
father. 

14.4 Find the correlation between Xj-|-X, and X^, JC, and X, 

being uncorrelated. 

14.5 Find the correlation between X, and aXi+bXf, -X", and X, being 
uncorrelated. 

14.6 (Referring to 13.17.) Use the answer to Exercise 14.5 to estimate, 
very roughly, the correlation that would be found between aunual 
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movements in infantile and general mortality if the mortality of those 
under and over 1 year of age were uncorrelated. Note that — 


General mortality per 
1,000 of population 


=Infantile mortality per 1,000 births x 


Births 

Population 


fDeaths over one year per 1,000 of population 


and treat the ratio of births to population as if it were constant at a rough 
average value, say 0*032. The standard deviation of annual movements 
in infantile mortality is {loc. cit.) 10*76, and that of annual movements in 
mortality other than infantile may be taken as sensibly the same as that 
of general mortality, or, say, 1*13 units. 


14.7 If the relation 

holds for all values of x^, and (which are, in our usual notation, 
deviations from the respective arithmetic means), find the correlations 
between x^, and x^ in terms of their standard deviations and the values 
of a, b and c. 


14.8 What is the effect on a weighted mean of errors in the weights of the 
quantities weighted, such errors being uncorrelated with one another, with 
the weights or with the variables : (1) if the arithmetic mean values of 
the errors are zero, (2) if the arithmetic mean values of the errors are not 
zero ? 


14.9 The following are the variances of the rainfall (1) for January to 
March, (2) for April to December, (3) for the whole year, at Greenwich in 
the eighty years 1841-1920, the unit being a millimetre — 

January-March ai*= 1,521 

April-December. ff,*= 8,968 

Whole year . o*=10,754 

Find the correlation between the rainfall in January-March and April- 
December. 


14.10 If of three variables A, B, C, the variance of the sum of A and B 
is the sum of the variances of A and B and the variance of the sum of 
B and C is the sum of the variances of B and C ; show that the variance 
of the sum of A and C is not necessarily the sum of the variances of A 
and C. What must be the correlation between A+B and B-|-C for 
this to be true ? 



CHAPTER FIFTEEN 


SIMPLE CURVE FITTING 


Hie problem 

15.1 In this chapter we turn aside somewhat from the line of develppment 
of previous chapters in order to study a subject of considerable theoretical 
and practical importance — ^the representation of relationship between 
two variables by simple algebraic expressions. Our work on correlation 
has already led us to fit regression lines and planes to the means of arrases. 
We now attack a rather more general problem. An illustration will make 
clear the type of inquiry involved. 

TABLE 15.1.— Ettimated dUtance and vdocitlct of recetilon of 10 extra-galactic 

ndmlae 

(Edwin Hubble and Milton L. Huniason, '‘The Velocaty*di$tance Relation among Extra-galactic Mebulae/* 
Contributions from Mount Wtlson Observatory, Carnegie Inatitute of Waihington, No. 427 ; Asiropkytieai Jourmi, 

1931. 74, 43). 


Constellation in 
which the nebula 
is situated 

Mean velocity 
(kilometres per 
second) 

Distance 
(millions of 
parsecs) 

Isolated Nebula II . 

630 

1*20 

Virgo 

890 

1*82 

Isolated Nebula I . 

2.350 

3-31 

Pegasus 

3.810 

7-24 

Pis^s 

4.630 

6-92 

Cancer 

4.820 

9- 12 

Perseus 

5.230 

10-97 

Coma 

7.500 

14-45 

Ursa Major 

11.800 

22-91 

Leo . 

19.600 

36-31 


Table 15.1 shows the estimated distance and velocities of recession of 
certain nebulae in the outlying parts of the viable universe. 

A little inspection of the table will show that there appears to be some 
relation between distance and velocity— the greater the one, the greater 
the other, with only one exception. A diagram makes the relation clearer 
still. In fig. 15.1 we have t^n the two variables velocity and distance 
as rwtangular co-ordinates y and x, and have marked for each nebula 
a point whose co-ordinates are the distance and velocity of that nebula- 
The ten points so obtained evidently lie very approximately on a straight 
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Kne or, to express the same fact algebraically, the ten values of the variables 
are closely represented by an equation of the form 

y=«o+«i* • • ‘ . (15.1) 

where we use small letters to denote current co-ordinates. 

15.2 No straight line, however, passes exactly through all the points, 
although a great many lines may be drawn which nearly do so. The 
question then arises, is there a straight line which fits the points better 
than all others, and if so, which is it ? Or, in other lamguage, what values 
of (Zg and in equation (15.1) must we take to get the best representation 
of the linear relationship between the two variables ? And, as a further 
question, can we devise a meaisure of the closeness of the fit of the various 
lines which can be drawn ? 



DistscatceCmUlims of parsecs.) 

Fig. 15.1. — RcUtloiuhlp between distance and velocity of recession in certain extra- 

galactic nM»ulae. (Table 15.1) 

15.3 In the foregoing illustration it is clear from the data or from the 
diagram that a linear relationship between the variables gives a very 
close picture of the truth. In other cases the points of the diagram will 
lie more or less on a curve, and no straight line will give a satisfactory 
representation. We should then wish to investigate whether the depend- 
ence of y on X may be suitably represented by the more general equation 

y=!ao-H«i*+«s**+ +a^*^ (15,2) 

which, in the diagram, corresponds to a curve of the type known as 
parabolic. The number p indicates the i^ee of the parabi^, and we 
speak of quadratic, cubic, quartic parabolas, meaning carves of type 
(15.2) with P"K‘2, 3. 4, respectively. 




34 ^ 


THEORY OF STATISTICS 


15.4 Our general problem may, then be stated as follows : Given n 

pairs of values of two variables, X^Y^ • • • X^Y^, to express the 

values of one of them as nearly as may be in terms of the other by an 
equation of the form (15.2) ; and to measure the closeness of the approxi- 
mation of the values of y given by the equation to the actual values. In 
geometrical language, given n points in a plane, to tit to them a curve of 
the parabolic type (15.2) and to measure the closeness of fit. 

15.5 The representation of data in this way may serve several pjurposes. 
In the first place, it may present the relationship between the two variables 
in a useful summary form. Secondly, it may be used to interpolate, i.e, 
to estimate the values of one variable which would correspond to specified 
values of the other. In fig. 15.1, for example, the straight line\which 
has been drawm in, and whose equation is obtained below, tells u^ what 
we might expect to be the velocity of a nebula whose distance is. say, 
20 million parsecs, on the assumption that the linear relation holds good 
for nebulae in general. 

15.6 Again, the representation may also be very suggestive to the 
theorist. Tlie linear form of the relationship between the variables of 
Table 15.1 involves more than a convenient summary of the facts, and has 
inspired a great deal of research into the nature of the physical universe. 
In such cases, the derived equation is regarded as the expression of a law 
of nature, and the deviations of the observed values from those given 
by it are interpreted as fluctuations arising from experimental error or 
secondary perturbations. This standpoint is common in physics, in w^hich 
data often lie very closely about a smooth curve. 

The method of least squares 

15.7 Let us suppose that we have n pairs of values X^Yj, . , . X^Yn* 
and that we wish to represent them by an equation of the type (15.2). 
Our problem is, having fixed the value of p, to determine the constants 

in terms of the observed values X, Y, so as to get the best 

possible fit. 

The expression best possible fit " may be defined in more than one 
way, and consequently there is no unique method of determining the 
constants. Several methods have been proposed; and our choice between 
them is determined mainly by convenience. One way, which is suggested 
by the geometrical representation, is to choose the curve of equation 
(15.2) so that the sura of the distances (taken as positive) of the points 
from it is a minimum, the sum of the distances being regarded as a measure 
of goodness of fit, and the best fit being given by the curve of specified 
degree for which that sum is least. But this method, whatever its 
theoretical attractions, suffers from the disadvantage that it is difficult 
to apply in practice except for the straight line. 

An alternative method, which is in almost universal use at the present 
time, is that known as the MUhod of Least Squares, and we proceed to 
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discuss it at length. We have already used it to find regression lines 
(9.20 and 12.4). 

15.8 If we substitute for the- value in equation (15.2) we get a quantity 

y,. given by 


y,=a„+a,X,+a^,*+ . • . (15.3) 

This is not in general the same as y,, and we therefore define the residual 
Sr as 

ir=Yr-yr=Yr-‘*0-atXr- • • • ■ (15.4) 

There will be n residuals, one for each pair X, Y, and they are all zero 
if, and only if, the curve is a perfect fit. We then take the sum of the 
squares of residuals — 


t;=2:(^*)=i:(y,-ao~«iXr- (i5.5) 

If y is zero, each residual must be zero, and the data are represented 
perfectly by the equation. Except in this case, U is positive. The 
further the points lie from the curve of equation (15.2), the greater U 
will be. U therefore provides one measure of the closeness of fit. From 
this standpoint, the best fit will be that for which U is least. 

The Method of Least Squares adopts this criterion, and states that 
the constants a shall be determined so that U is a minimum. 

15.9 The reason for taking the sum of squares of residuals, rather than 
the. sum of residuals simply, is akin to that which led us to prefer the 
standard deviation to the mean deviation as a measure of dispersion 
(Chap. 6), namely, that the former is more convenient in theory and leads 
to equations which are easier to handle in practice. 

15.10 It was formerly the custom, and is so still in works on the theory 
of observations, to derive the method of least squares from certain 
theoretical considerations, the assumed normality of the distribution of 
errors of observations being one such. It is, however, more than doubtful 
whether the conditions for the theoretical validity of the method are 
realised in statistical practice, and the student would do well to regard 
the method as recommended chiefly by its comparative simplicity and by 
the fact that it has stood the test of experience. 

15J.1 Consider now the quantity U, given by equation (15.5). 

a, are to be chosen so that this is a minimum, say I«t us 
imagine this done. 
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If. now, we substitute in equation (15.5) Ug+eo for a,, for 

«,+6i for a„ and so on, we shall get a quantity 17^ given by 

C7,=r{y-(a.+e,)-(a,+e,)X- . . . -{a,+e,)XPy 

and Ui is greater than for all values of e,, ej, . . . e^. 

Now, 

i 

. . . -a^^)-(«,+e,X+ . . . 

=2(y-«o-«iA’- . . . ' 

— 22{y— ^0 — ctiX — . . . — 

+S(€0+€i*X’+ • • • -{-^pXP)^ 

The first of these terms is equal to t/,. Hence, if Uy'^U^, we mus^ have 
— 2£(y — — Or^X — . . . — <i^P)(ef^-\- €iX 

+L(e((+e|X4- . . . -{-€pXP)*^0 ..... (15.6) 

This is to be true for all values of e, . . . e,. Let us then take these 
quantities to be very small. The second term in equation (15.6), depend- 
ing as it does on the squares of the e’s, will be small compared with the first, 
and may be neglected. (15.6) will then be true only if the first term 
vanishes, for otherwise the e’s could be so chosen in sign as to make the 
first term negative. 

Hence, 

Z{Y-a^~a^X- . . . -«,^^)(eo+eiX-f . . . -f-e^^)=0 . (15.7) 

This is true for aU small values of the e's. Hence the coefficients of 
fij, fj, . . . e,, all vanish, i.e. we have — 

2(y) -ao« —aJliX) - . ~a;Z{XP) =0 

J.{YX) -afi{X) -aj:{X*) - . -a^(XP+^) =0 

'L{YX')-afi{X *) -. -«;E(X^+*) =0 (15.8) 


'L{yXP)-a^{XP)-a^{XP^^)- . -^^^{X*P) =0 

The equations (15.8) give us 1 equations in the (p.+l) unknowns 
Hft . . . Hence they may be solved so as to give the a’s in terms of* 
the calculable quantities I(.X). I(.X*). • • • £(A‘V), ZIY). Z(YX), . . . 
Z(YXP). 

15.12 It will be seen that the solution of these equations depends on 
the evaluation of the various summed quantities. A first step is therefore 
to calculate these sums, and this is done by a process very similar to that 
used in fin ling the moments of a distribution. 
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We can, in fact, express the equations in terms of moments. Dividing 
each equation by n, and remembering that we have — 


-S(y) -<io ~<*i/*i* — 

^{YX) 

ft 


«*/*»' =0 


-«*/»*' +1 =0 


(15.9) 


fl 

Equations for fitting a straight line 

15.13 In the simplest case, that of a straight line, we have p=l, and 
the equations (15.9) become — 


^;Z(y) =flo+«lAl' 


^YX)=ao/ii+ai/tt 


(15.10) 


In particular, if X and y are measured about their means and hence 


are denoted by x, y, we have — 


and hence, from (15.10), 

i»i=0 

S{y)=0 



«o=0 


so that the fitted line is 

n/tf 



>'=*r^S(yx) 

tifif 

. (15.11) 

i.e. passes through the mean 

of X and y. This is, in fact, the first r^ession 


equation of (9.6) (p. 216) in another form. 

15.14 In equation (15.2) it is customary to call x the " independent " 
variable and y the “ dependent ” variable. In any given case it is, as a 
rule, possible to regard either of the variables under consideration as the 
independent variable, and the other as the dependent variable. We s hall 
then get two expressions, one giving variable A in terms of variable B, the 
other giving B in terms of A •, and there will be two curves of closest fit, 
just as there are two regression lines in the theory of correlation. 
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These two curves are not, in general, the same, and the result sounds a 
little paradoxical until we examine how the two curves are derived. We 
have, in fact, two definitions of closest fit, one minimising residuals of the 
type [A--aQ--aiB— .)*, the other minimising residuals of the type 
(B— ao'— .)*. On a priori grounds there is nothing to choose 
between the two, 

15.15 Which of the two forms we choose will depend in practice on 
a variety of circumstances. Sometimes one variable is clearly marked out 
as the independent variable. For example, in considering the way in 
which a population varies with time, it is almost inevitable to regard the 
former as dependent on the latter, and not vice versa. In other bases the 
choice is dictated by the purpose in view. For instance, in expreifsing the 
relationship between current and resistance in an electric circuii, an in- 
vestigator would probably take as the independent variable that factor 
over which he had direct control. Frequently, however, there is no guide 
of this kind, and it may be necessary to ascertain both curves. See 15.27 
below. 

Calculation 

15.16 The calculations necessary to fit a curve by the method of least 
squares fall into two stages. First of all, the sums of squares which 
appear in equation (15.8) must be found, or, what amounts to the same 
thing, the moments. To fit a curve of degree p it is necessary to find 2p 
sums of the type S(X*) and +1 sums of the type E( YX*) (including S( Y)). 
The work is best carried out systematically after the manner of Chapter 7, 
and several devices considerably shorten the arithmetical labour. 

(a) By a suitable choice of origin and unit we can often reduce the 
given values of X and Y to smaller numbers — ^a great help in calculating 
the higher powers and sums. For instance, if the values of Y were 625, 
650, 675, 700, we could take an origin at y =625, and a scale of one unit 
=25, and our new values would then be 0, 1, 2, 3. 

(d) If the values of the independent variable proceed by equal steps, 
and particularly if there is an odd number of them, the labour of calcula- 
tion is enormously reduced. We shall consider this important case in 
some detail below (15.22). 

When the various sums have been ascertained, the second stage, that 
of the solution of the equations (15.8), may be carried through. For a 
curve of degree p there are p+1 of these equations. They are linear in 
the unknowns a, and their solution offers only arithmetical difficulty. 

15.17 Before proceeding to consider some examples, we may remark 
on one point of theoretical interest. It is always possible to fit a curve 
of degree p exactly to ^ + 1 points ; for instance, a straight line can be 
drawn to pass exactly through two points, a cubic parabola through four 
points, and so on. Thus, if we have n points we can always find a curve 
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of degree n — 1 which is an exact fit. But in practice n is rarely less than 
ten, and a fitted curve of degree as high as this would have no practical 
value and very little theoretical interest. It is only exceptionally that use 
is found for fitted curves of degree higher than the fourth. 

We will now consider some examples. 

Example 15.1. — Let us fit a straight line to the data of Table 15.1. To 
illustrate the method we will deal with both cases, taking first distance and 
then velocity as the independent variable. 

Denoting, then, distance by x and velocity hy y, we wish to fit a curve 
of the form 

For this we require S(X), S(Y) and 2(yA’'). For the alternative 

case we shall also require ^(y^). 

The arithmetic is shown in Table 15.2. In successive columns we write, 
for each nebula, Y, X, YX and Y^ Totals are shown at the foot of 
the columns. 


TABLE 15.2. — ^Practical work for fitting a straight line to the data of Table 15.1 


Constellation i 

Mean velocity 
(000 km. per 
second) 

y 

Distance 
(millions of 
parsecs) 

X 

X^ 

YX 

Y2 

Isolated Nebula IT 

0-63 

1-20 

1-4400 

0-7S60 

0-3969 

Virgo . . 1 

0-89 

1-82 

3-3124 

1-6198 

0-7921 

Isolated Nebula 1 j 

2-35 

3-31 

10-9561 

7-7785 

5-5225 

Pegasus . . 1 

3-81 

7-24 

52-4176 

27-5844 

14-5161 

Pisces . . i 

403 

6-92 

47-8864 

32-0396 

21-4369 

Cancer . . | 

4-82 

9-12 

83-1744 

43*9584 

23-2324 

Perseus . . ; 

5-23 

10-97 

120-3409 

57*3731 

27-3529 

Coma . . j 

7 '50 

14-45 

208-8025 

108-3750 

56-2500 

Ursa Major . | 

11 *80 i 

i 22 91 

524-8681 

270-3380 

139-2400 

Leo 

19-00 

I 36-31 

1318-4161 

711-6760 

384-1600 

Total . ! 

1 

61-26 

114-25 

2371-6145 

1261-4988 

672-8998 


Equations (15.8) then become 

S( YX) -ao2(X) -aiS(X2) -0 
or 

61 •26-lOao-l l4-25a, =^0 
1261 • 4988-- 1 14 • 25ao-2371 • GUSa^ «=0 

Sultiplying the first of these by U4-25 and the second by 10, and sub- 
tracting, we get 

5616-033-10.663-0825ai==0 

(more accurately, 0^526,680,066) 








348 

and hence. 
So that 


THEORY OF STATISTICS 


«o =0*109 (more accurately, 0*106,680,240} 

y =0*109+0*527* ... (a) 

This line is shown in fig, 15.1, 

If we wish to express distance in terms of velocity, we have, inter- 
changing X and y in equations (15.8) — 

*=ao'+ai'y 

2:(X)-ao'n-a,'£(y)=0 

5:(xy) -ao'S(y) -ai'£( V*) =0 

or 

1 14 * 25 - lOao' -61 * 26ai' =0 
1261 *4988 -61 *26ao'-672*8998ai'=0 

whence 

a,' = -0*135 
ai'= 1*89 

and 

*=-0*135+1 *89;’ . {b) 

Equations (a) and {b) are nearly identical, for dividing (a) by 0*527 
and rearranging, we have — 

x = -0*207+l*90y 

This is exceptional, and results from the closeness with which the points 
lie to a straight line. The correlation between X and Y is, in fact, 0*997. 

if ' 

Reduction of data to linear fonn 

15.18 Example 15.2. — It sometimes happens that we may reduce data 
to a linear form by some simple transformation. Table 15.3, for 
example, shows the number of fronds of a duckweed plant on fourteen 
successive days. The number of fronds {N) clearly does not increase 
uniformly with time (*), and the curve of growth is not linear, as may be 
seen by graphing iV against x. There are theoretical reasons for inquiring 
whether the law of growth may be represented by an equation of the form 

iV=a«»* 

A population which conformed to this equation would have the property 
that its rate of increase at any moment was proportional to the size of 
the population at that moment — ^its "birth-rate,” so to speak, would be a 
constant. 

Taking logarithms, we have — 

log, JV—log, «+6* 
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and if we now write 3'=log, N, we have— 

»+bx 

which is linear in x and y. 

We should, of course, have a relation of the same form, with different 
values of the constants a and b, if we took logarithms to base 10, which 
is usually the more convenient procedure. 

We therefore try the effect of fitting a straight line to x (the time) and 
logio N (log number of fronds). From fig. 15.2 it will be seen that the 
fit is a close one. 



Da^e 


Fig. 1S.2. — Straight Une fitted to data of Table 15.3. (Growth of dnckwccd) 


The preliminary work is shown in Table 15.3. We find first Y, cone* 
spending to log^ N, then S(X), 'Z{Y), S(X*), S(YX). For this particular 
example we do not require S(Y*). In view of the simple character of 
the values of X there is little saving in taking other origins or units for 
X and Y, although, if we were fitting a curve of higher order, it might 
be an advantage to take a different origin for X. 
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TABLE 15.3.-~GfOWth of duckweed 

(V. H. Blackman, Nature, 6th June, 1936, quoting data of Ashby and Oxley*) 


Number of fronds 

N 

Y 

Days 

X 

x^ 

YX 

100 

2-0000000 

1 

1 

2-0000000 

127 

2-1038037 

2 

4 

4-2076074 

171 

2-2329961 

3 

9 

6-6989883 

233 

2-3673559 

4 

16 

9-tt694236 

323 

2-5092025 

5 

25 

12-5460125 

452 

2-6551384 

6 

36 

15-9308304 

654 

2-8155777 

- 7 

49 

I9-:™0439 

918 

2-9628427 

8 ! 

' 64 : 

23-7027416 

1,406 

3-1479853 

1 ^ 

1 81 1 

28-3318677 

2,150 

3-3324385 

10 

1 100 

33-3243850 

2,800 

3-4471580 

! 11 

1 121 

37-9167380 

4,140 

3-6170003 

12 

144 

43-4040036 

5,760 

3-7604225 

13 

169 

48-8854925 

8,250 

3-9164539 

14 

H)6 

54-8303546 

Total 

40-8683755 j 

105 ! 

1 

1015 

340-9594891 


Equations (15.8) then become — 

S(y)-Mao-a,S(A')=0 
S{ YZ) -a^(X) -ai2:(A'*) =0 
or 

40-8683755-14ao- 105a, =0 
340-9594891 -105ao- 1015a, =0 

whence 

ao=l-785 
a, =0-1514 

and 

^=1 -785 +0- 1514;r ... (a) 

Raising this to power 10, and remembering that 10>'=N, we have — 

Ar=10i”»xl0«-“»** ... (6) 

which we may also write, expressing the powers of 10 as actual numbers— 

N=60-95x(l-417)* 

15.19 ExampU 15.3. — ^The process of taking logarithms may be applied 
to both variables. In Table 15.4 are given the costs per unit of electricity 
sold [ij) and the number of units sold per head of the population served 
by the undertaking for 27 electricity undertakings. The data were 
taken from the Returns of the Electricity Commission for 1933-34, which 
cover about six hundred undertakings, by selecting every twenty-fifth. 
They are, therefore, only a comparatively small sample, but they reflect 
fairly accurately the general relationship between i and ij for the whole 
number of undertakings. 
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This relationship is illustrated by fig. 15.3, on which g is graphed against 
It will be seen that, broadly, the larger the number of units sold per 
head, the lower the cost per unit. 

The points of fig. 15.3 lie, in fact, about a curve which si^gests a 
relation of the form — 


As g becomes larger, if becomes smaller, and as g tends to zero, 7 tends to 
infinity. Let us try to fit a curve of this kind to the data. 

We have — 


and, putting 


log 7 =log a—b log g 
y=log 57 , a:=log g 


. y=log a—bx 

which is linear. We therefore proceed to fit a straight line to log tj and log g. 
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The preliminary work is shown in Table 15.4. Equations (15.8) become, 
in the usual way, 

5-2493-27«,-50- 1311«i=»0 
7 • 3008 -50 • 131 la® -97 • 1450«i =0 

whence 


and 

From which 
or 


ao=l-31 ai=-0-601 

y =1-31 -0-601* . (a) 

. (b) 

7=20-42g-^i 


Fig. 15.4 shows the values of y plotted against those of *. Thq straight 
line we have found cannot be described as a good fit, but so far as the eye 
can judge it is as good as any simple curve is likely to be. It expresses 
the general relation between * and y ; but, naturally, local circumstances 
cause individual values to deviate appreciably from this relation. Statis- 
tical data which are not produced under laboratory conditions are very 
often of this nature. The fitted curve expresses a general trend, but 
individual cases may lie well away from it in a number of instances. 

Fitting of more general curves 

15.20 Example 15.4 — We must now consider the fitting of curves of order 
higher than the first. 

Table 15.5 on p. 356 shows the percentage loss of weight (V) for certain 
temperatures (X) in experiments on the oven-drying of soils. Since X is 
here the controllable factor, it is natural to take it as the independent 
variable, and we shall express Y in terms of X. 

The data are shown graphically in fig. 15.5. We shall find sucqsssively 
the straight line, quadratic parabola and cubic -parabola of closest fit. We 
shall therefore require sums of powers of X up to £(X*) and sums of 
products up to L(yX*). We also require, for later work, S(y*). 

The preliminary work is shown in Table 15.5. We might, perhaps, 
have abbreviated the arithmetic slightly by taking an origin of x at 
X=100 and of y at y«3, but the saving would not have been large. 
Data of this kind frequently give rise to large~^;ures in the higher sums, 
and a machine is a great help in the calculation. For instance, with a 
machine the sums hiYX), etc., can be found by continuous aiddition, 
without the necessity for writing each individual contribution in the 
relative column. 

For the straight line of closest fit, equations (15.8) become — 


82 -97 -16a,-2642aj»0 
14,736- 19-2642«,-474,050«,-0 


whence 


a,»0-660 and aj»0-02741 
(more accurately, 0- 6^,759,789 and 0-027,406,722) 
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TAM^S 15.4.^RedttCtlon of non-Hnear relation to Mnear form 

Relationship between Working Costs per Unit and Number of Units Sold in 27 Electricity 

Undertakings. 

(Data from Return of Engineering and Financial Statistics, 1933*34— Electricity ConunfaakmO 


Name of 
undertaking 

Working 
costs per 
unit sold 
(pence) 

V 

Units sold 
(excluding 
bulk 
supplies) 
per head of 
population 

log 7 

= y 

log 6 

YX 

X* 

Aberdare 

1-53 

63- 1 

0*18469 

1*8000 

0*3324 

3*2400 

Barry U.D.C. 

2-36 

121 

0*37291 

1*0828 

0*4038 

1*1725 

Bredbury and 
Romiley . 

0*70 

394*2 

-0*15490 

2*5957 

-0*4021 

6*7377 

Chesterfield. 

0-56 

220*5 

-0*25181 

2*3434 

-0*5901 

5*4915 

Earby 

1-41 

52*4 

0*14922 

1*7193 

0*2566 

2*9560 

Grange 

1-88 

119*4 

0*27416 

2*0770 

0*5694 

4*3139 

Holmfirth . 

117 

181*6 

0*06819 

2*2591 

0*1541 

5*1035 

Lincoln 

0-78 

293*8 

-0*10791 

2*4681 

-0-2663 

6*0915 

Mexborough 

M3 

170*4 

0*05308 

2*2315 

0*1185 

4*9796 

Nuneaton . 

0-86 

184*1 

-0*06550 

2*2651 

-0*1484 

5*1307 

Redcar 

1-91 

68*0 

0-28103 

1*8325 

0*5150 

3*3581 

Slaithwaite 

1‘40 

80*7 

0*14613 

1*9069 

0*2787 

3*6363 

Tanheld . ; 

2-41 

29*0 

0*38202 

1*4624 

0*5587 

2*1386 

West Lancs R, D C. 

1-37 i 

53*4 

0*13672 

1*7275 

0*2362 

2*9843 

Dumfries Corp. . 

MO 

93*0 

0*04139 

1*9685 

0*0815 

3*8750 

Tobermory 

4-21 

19*9 

0*62428 

1*2989 

0*8109 

1*6871 

Abera)rron . 

8*9 

25*6 

0*94939 

1*4082 

1*3369 

1*9830 

Brixham Gas and 
Electric Co. 

3- 13 

30*4 

0*49554 

1*4829 

0*7348 

2*1990 

Chudleigh Co. 

7-28 

16*7 

0*86213 

1*2227 

1*0541 

1*4950 

Foots Cray Co. . 

1-92 

77*8 

0*28330 

1*8910 

0*5357 

3*5759 

Lewes Co. . 

M4 

120*1 

0*05690 

2*0795 

0*1183 

4*3243 

Newcastle Electric 
Light Co. 

0*64 

68*8 

-0*19382 

1*8376 

-0*3562 

3*3768 

Ramsgate Co. 

1*57 

60*5 

0*19590 

1*7818 

0*3490 

3*1748 

Steyning Co. 

L06 

93*9 

0*02531 

1*9727 

0*0499 

3*8915 

West Devon Co. . 

1-98 

22*1 

0*29667 

1*3444 

0*3988 

1*8074 

Coatbridge and 
Airdrie Co. 

0-68 

196*2 

->0*16749 

2*2927 

-0*3840 

5*2565 

Skelmorlie Co. 

2-05 

60*1 

0*31175 

1*7789 

0*5546 

3*1645 

Total 


HI 




97*1450 


and the straight line is — 

==0-660 +0-0274 la: (a) 

For the quadratic parabola, equations (15.8) are — 

£(y) -ajS(X>)=0 

i:{YX) ~a^{X)-a,-L(X^) -a,S(X*)=0 
S(yX*) -«oS(X*) -«iS(A-») -ajS(A<)=0 
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LogcaitJm ofiumber of uziits sold perhecd of population. 

Fig. 15.4.— Straight ttne fitted to logarithm* of data of TaUe 15.4 


These become, on substitution, 

82 • 97 - 16ao -2642a, -474,050a, =0 
14,736* 19 -2642ao -474,050aj -91 ,244,582a, =0 
2,819,909-45 -474,050ao-91 ,244,582ai - 18,553,164,842a,=0 
giving 

ao=3-551, ai=-0-009291, a, =0-00010695 

(more accurately, 3-550,990,2, -0-009,291,235,7, and 0-000,106,954,12) 
and the parabola is — 


y=3-551-0-009291*-|-0*00010e95** (6) 

For the cubic parabola, equations (15.8) are — 

S(y) «ao -a,2{A’) a,2(X>) -a,2(X»)=0 

S(yX) aoS(A') -a,2(X*) a,S(A'») -a,5:(A*) =0 

1(YX*) -a,E{A*) -a,X(A») -a,S(X«) -a,S(A»)=0 

2(yA») -ao2(A») a,£(A‘) -a,X(A‘) -a,2(A‘)=0 

which become — 
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It is not really necessary to write out the large numbers of the later 
equations as fully as we have done, and a certain amount of approximation 
is allowable. The student should, however, be careful not to introduce it 
too soon, as neglected quantities may become of cumulative importance 
in the solution of the equations. 

By straightforward but rather strenuous arithmetic we find — 


ao=7-783, 

=0-0005875, 

(more accurately, 

uo =7 -782,526,861. 
a, =0-000,587,479,234,2, 


«i= -0-08940 
a,= -0-0000009189 

fl, = -0-089,402,395,60 
a3= -0-000,000,918,891,069,8) 


The smallness of the coefficients and a, does not mean that they are 
of minor importance, since in the equation for y they are multiplied by 
terms in ** and x^, which may be large. 

The cubic parabola is, then. 


y =7 - 783 -0 - 08940^ +0 - 0005875;t* -0 - 0000009189x» 


which we may also write as — 

y=7.7M-8.940i-5.875(iy-0.9.89(i)‘ W 


Fig. 15.5 shows the data graphically, with the straight line and cubic 
parabola of closest fit. 




TABLE to aqnrm the rdittomliip between tempcnlare and percentage Iocs in weight of certain loU sam^ 

(Otta from J. R. H. Coattf, ** ‘Single Value » Soil Properties : V. On the Changes Produced in a SoU by Oven-drying/’ Joufnal Ainatiimtl Science, 1S30, 20, 541.) 
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15.21 Although a graph will usually suggest whether a straight line 
or quadratic parabola is likely to give a satisfactory fit, it will not as a rule 
be much guide in deciding whether' further terms will repay the labour 
of calculation. This can be judged, at least roughly, by ralmlatin g 
the terms given by the polynomial (to as high a degree as it has been 
carried) for the observed values of *, and then observing the run of the 
residuals. If the signs run more or less at random it will hardly be 
worth while to calculate another term ; but if a series of positive residuals 
is followed by a series of negative residuals, these by another series of 
positive residuals, etc., it will probably be worth while to proceed further. 
Moreover, the coefficients for a parabola of order k are no guide to those 
of order ^+1. For instance, in Example 15.4, the values of a, for the 
straight line, square parabola and cubic parabola are 0'660, 3*551, 7*783 ; 
and those of are 0*02741, —0*009291, —0*08940. From this informa- 
tion we could not guess even the sign of these coefficients in the parabola 
of order 4, and if we wished to fit such a curve five equations of the type 
(15.8) would have to be solved ab initio. 

The student, therefore, should not fall into the error of thinking that 
parabolas of successive orders will resemble each other in their lower 
terms, or that the fitting of a curve of order ^ 1 is merely a question of 

adding an extra term to a curve of order k. It would be a great con- 
venience if this were so, and, in fact, methods have been devised whereby 
one variate can be expressed in terms of certain polynomials of the other 
in such a way that this advantage is secured. The theory of these 
so-called “ orthogonal ’’ polynomials is, however, outside the scope of 
the present work. 

The case when the independent variable proceeds by equal steps 

15.22 When the independent variable x proceeds by steps of equal 
amount h, the arithmetical solution of equations (15.8) can be greatly 
simplified, particularly if the number of values is odd. In such a 

we take h as the unit of x and an origin at the middle term. The values 
of X will then be —k, — (A— 1), —{A— 2), . . . —2, —1,0, 1, 2, . . . 
(*— 2), (ft— 1), ft, and owing to the symmetry of this series the sums of 
odd powers of x will vanish, i.e. S(X), £(Jlf*), S(A'*), etc, are aU zero. 
Equations (15.8) then become, taking p as odd, 

S(Y) -na, -a^{X*) -«*£(X«) ... =0 

I.(YX) -«,S(X*) -a,S(X*) ... *0 

(15.12) 

^YXP-^)-a^{XP-^) -a^{XP+^) ... =r0 

I.{YXP) -aj:{XP+^) -a^{XP+>) ... »0 

and not only is the number of terms reduced, but the eqnatioiw split 
into two sets, one in at. etc., and the other iiLa^, etc. Mtwe- 
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over, the sums of even powers of X are twice the sums of powers of the 
first k natural numbers, which may be easily found, either from tables 
or from known formulse. 

Example 15.5. — Table 15.6 sliows the population of England and 
Wales in certain census years from 1811 onwards. Taking the time as 
the independent variable, we choose as the unit of X the period of ten years, 
and the origin at the mid-point of the range, 1871. The preliminary work 
for the fitting of curves up to the cubic form is shown in the ta|>le. 

For the cubic parabola, equations (15.8) are, then, ' 


314-09- 

-1300 

—1820, 

=0 

-45500, =0 \ 

474-77 


-1820, 

4520-45- 

18200 

-4550a, 

=0 

11,632-97 


-45500, 

-134,3420, =0 


whence 

ao=23-299 a,= 2-895 

0-06153 «,=-0-01147 

The parabola is, therefore, 

y =23 • 299 -f 2 • 895;t -fO - 061 53a:* -0 - 01 1 47*» (a) 

Fig. 15.6 shows the data graphically, together with this cubic. 

Incidentally, this example illustrates one point of some importance. 
Over the years 1811 to 1931 the cubic gives a fair fit, and might be used 
to estimate the population at intermediate years. But for extrapolation 
it is of very little value. We could not estimate the population for 1961 
with any confidence by putting *=9 in the cubic ; still less thsit for later 
years. Unless there are good reasons for supposing that the fitted curve 
is an accurate representation of a theoretical relationship, it is dangerous 
to assume that a fitted parabola can be used outside the range for which 
it was ascertained. 

It would be instructive for the student to fit merely a segment of some 
actual series and note bow rapidly the curve calculated from the segment 
diverged from the observations outside its limits. It has been shown that 
even within the limits of the fitted observations the fit tends to be worst 
as the limits are approached. The higher powers of * become of greater 
and greater effect the more we diverge from the centre of the fitted 
segment and tend, so to speak, to " wag the tail " of the curve. 

15.23 If the number of values of x is even, we have a choice of two 
methods of procedure. We can take A as unit and the origin at one of 
the two middle values ; or we can take JA as unit and origin midway 
between the two central values. In the first case, the sums of odd powers 
will no kmger vanish, but they will nevertheless be easily cakalaUe, 
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TABLE 15.6.— Curve-fitling to giowtli popnlatkn in England and Wakt 
(Otta fram Regbtnr-G«iiei«rt Stttktical Havlew of England and Wain, I93S, Tablo, Part IL) 


Year 

f 

Popu- 

lation 

(mirns) 

V 

X 

X» 

X» 

X^ 

X^ 

YX 

YX» 

YX* 

1811 

10- 16 

-6 

36 

-216 

1,296 

46,656 

-60*96 

365*76 

-2,194-66 

1821 

1200 

-5 

25 

-125 

625 

15,625 

-60 00 

300*00 

- 1,500*00 

1831 

13*90 

-4 

16 

- 64 

256 


-55*60 

222*40 

- 889*60 

1841 

15*91 

-3 

9 

- 27 

81 

729 

-47*73 

143*19 

- 429*57 

1851 

17*93 

-2 

4 

- 8 

16 

64 

-35*86 

71*72 

- 143*44 

1861 

20*07 

-1 

1 

- 1 

1 

1 

-20*07 

20*07 


1871 

22*71 

0 

0 


— 



— 

— 

1881 

25*97 

1 

1 

1 

1 

1 

25*97 

25*97 

25-97 

1891 

29*00 


4 

8 

16 

64 

58*00 

116*00 

232*00 


32*53 



27 

81 

729 

97*59 

292*77 

878*31 

1911 

36*07 



64 

256 


144*28 

577*12 

2,308*48 

1921 

37*89 



125 

625 

15,625 

189*45 

947*25 

4,736*2$ 

1931 




216 

1.296 

46,656 

239*70 

1,438*20 

8,629*20 

Total 

314*09 

0 

182 

^0 

4,550 

134.342 

474*77 

4,520-45 

11,632*97 



since all tenns except a angle outlying member in the summation vifll 
cancel out in pairs. In the second case the sums of odd powers will 
vanish, but the other sums will no longer be twice those of the &st t 
natural numbers, but of the firsts odd numbers. In either case the sohitiga 
of tite equations (15.^ is not difficult. 
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Calculation of the sum of squares of residuals 

15.24 The eye is not a reliable guide to the closeness with which a given 
curve lies to data, wd it is desirable to have some more accurate measure 
of the closeness of fit. For this purpose we require to be able to find 
the sum of the squares of residuals U. We know by our method of 
ascertaining the curve that this will be less than the corresponding quantity 
for any other curve of the same degree, and our interest is centred on how 
close this is to the ideal v«due zero. I 

To calculate the sum of squares of residuals it is not necessary to 
calculate each separate residual. In fact, for the parabola of (Vder p we 
have — ' 

U=I.(Y-ao-a^X-a^X^- . . . -apXP)* 

=E{y{y-ao-«i^- . . . -apXP)} 

for the terms of the type S{a*.y*(y—ao—aiX— . . . —apXP)} vanish in 
virtue of equations (15.8). Hence, 

f;=S(y*)-ao£(F)-ai2:(yZ)- . . . -apHiVXP) . (15.13) 


The constants a and the sums which appear in this expression have 
already been found, with the exception of S(y*) in some cases.. With 
this additional quantity we can find U. 


Example 15.6. — Let us find U for the data of Example 15.4 for the 
straight line and the two parabolas. 

For the line 

U=Z{Y^)-a^{Y)-a^I.{YX) 

Here 

i:(y) =82-97, SCyA^) =14,736- 19 
S(y*) =459 -4363, =0 659,759,789 


Hence, 


=0-027,408,722 

U =459,4363 -54 • 74027 -403-90014 


=0-7959 


For the quadratic parabola — 

f7=S(y*) -a^(Y) -fl,S(yA') -a^{YX*) 

and here 

3-550,990,2 
ai= -0-009,291,235.7 
«,= 0-000.!06,954.12 

whence 

=0-1271 


Similarly, for the cubic 


C/-0-0485 
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The value of U therefore decreases from 0-7959 for the stradght line to 
0*0485 for the cubic. This is what we should expect, for the addition of 
extra terms means that we have additional constants at our disposal in 
the task of minimising U. 

To obtain £7 with any accuracy by the foregoing method it is necessary 
to ascertain the a’s to a considerable number of decimal places. 

Measurement of the closeness of fit 

15.25 The value of U enables us to make some sort of comparison 
between the fits of different curves to the same data ; but it is not, in itself, 
a satisfactory measure of fit, since it does not permit of the comparison 
of the fits of curves to different data. The measure U jn, which is the 
variance of errors of estimation, suggests itself, but this, like U, is not 
absolute, being dependent on the units in which we are working. For a 
satisfactory measure some form of ratio would have to be taken. 

Such a ratio arises in a natural way if we consider the correlation 
between the actual values of Y and those " predicted ’’ by the polynomial. 

let us, without loss of generality, suppdse that the values are measured 
from their mean, and let y, be the value given by the polynomial and Y, 
be the actual value. Then, as in 15.24, 

2:(y*)=E(Yy) ..... (15.14) 

l7=E{y(y-y)} 

=X(y»)-L(yy) . . . (15.15) 

Writing o„ Oy for the standard deviations of Y and y, and R for the 
correlation between them, we get, from (15.14), 

or 

Oy^sRur .... (15.16) 

and from (15.15), 

-=or,»-Ra^, 

or 

.... (15.17) 

Hence, substituting for cr, from (15.16), 

R»«l — . . . . . (15.1^ 

na,* ' ' 

which gives the correlation in terms of the ratio of U fn and the variance 

0P,». 

|{ is, in fact, analogous to the multiide c(»Telation coeffictoBt aad rim 
correlation ratio, and the equation (15.18) riioald be cioi)i|Mtre8 with 
equation (11.3), page 256, and equation (l!ll^, page 286. 

M 
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Example 15.7. — In Example 15.1 we have, using the data of Table 15.2 
and the constants found — 


<t/=67-28998-(6- 126)* 


=29-762.104 
17=1 -835.777,255 


/?*=! 


1 -835777,^55 
~297-62104 


=0-993.831,83q 


jR =0-99691 


For the soil data of Examples 15.4 and 15.6 we find — 

For the straight line jR =0-98627 
For the cubic /? =0-9991 7 

Thus, judged by the value of R, the straight line of Example 15.1 is a 
better fit than that of Example 15.4, but a worse fit than the cubic of the 
latter. 

15.26 As a general comment on the scope of the methods of curve- 
fitting described in this chapter, we may remark that although polynomials 
can always be fitted to data, the student should not assume that even the 
pol)momial of closest fit will necessarily be a satisfactory fit. It may 
exhibit peculiarities of behaviour which are entirely absent from the data 
themselves. He may well ask, when confronted by a given set of data, 
how he is to know whether they may be satisfactorily represented by a 
poljmomial. The answer is that he must fit one and see. Some further 
remarks on this point are gh-en later in 24.12, where similar questions 
arise in connection with interpolation and graduation. 

15.27 The reader must be mindful of the fact that in the type of curve- 
fitting discussed above there is an essential difference between the roles 
of the independent and the dependent variables, which accounts for 
there being two curves according to which variable is regarded as in- 
dependent. If y is the dependent and x the independent variable the 
minimisation of the sum of squares of residuals in the manner of 15.8 is 
equivalent to supposing that if there is a " true ” law under which y is 
equal to a polynomial in x, the " errors ” observed are in the dependent 
variable y, not in x. Per contra, if we suppose that the errors are in x, 
we must minimise the sum of squm-es of residuals in x, which makes the 
latter the dependent variable. 

15.28 Suppose, however, that x andy are known to be related by a linear 
equation but that both variables are subject to error. What is then the 
appropriate method of finding the best estimate of the unknown relation ? 
If the errors are small, as seems to be the case in Example 15.1, an approxi- 
mation is given by the methods we have used because the two lines of 
closest fit are nearly identical. But where the errors may be large, and in 
•ny case as a theoretical proUem where both variates are subject to error, 
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we may require to find a unique relation most probably (in some sense) 
rejMesenting the truth. This sort of problem may very well arise, for 
example, in physics where it is assumed that there exists a definite func- 
tional relationship between two quantities (the pressure and the reciprocal 
of the volume of a gas or the length and temperature of a metal rod) bothj 
of which are subject to errors of measurement. 

This type of problem is extraordinarily difficult to solve and we 
have no space to discuss it here at any length. A single illustration of 
the complications which arise will have to suffice. 

A plausible procedure to determine a unique straight line fitting a set 
of points on a scatter diagram is to minimise the sum of squares of per- 
pendiculars from the points on to the line. This is equivalent to finding 
the principal axis (10.9) which, in a sense, may be regarded as “ closest ’’ 
to the points. But unfortunately this line will vary according to the 
scale of measurement of the variates — ^if we double the scale of one and 
hence enlarge the scatter diagram by the factor 2 in one direction, the 
new line has a different equation from the old and the difference is not 
merely that the transformed variate is in the new scale. Geometrically, 
we may say that right-angles are not preserved in a di^am if it is 
stretched in one direction, so that perpendiculars from points to lines 
do not remain perpendiculars under such a transformation. The procedure 
we are considmng, therefore, whatever its merits as providing empirically 
a line of closest fit, is open to the theoretical objection that the answer 
it gives depends on the scale of measurement, which in many problems 
is repugnant to commonsense requirements. We do not, for example, 
expect the linear law connecting the length of a rod with its temperature 
to depend on whether we are measuring the latter in Centigrade, Fahrenheit 
or absolute units. The procedure is reasonably plausible if both variables 
are of the same kind, e.g. both temperatures, so that a change of scale 
affects both to the same extent. The difficulties become intensified if 
the underl 3 dng law is not linear.* 

SUMMARY 

1. A parabola of the form may be 

fitted to data by choosing the constants a so that the sum s<|lHaes of 
redduals 17=E(Y'— a,Y*— . . . —opXP)* is a minimum. 

2. This method leads to the equations 

2(Y) -mi, -a^{X) -afi[X*) - . . . -a^(X#) • «0 

2(yX) -a^{X) -tfiS(X*) -a^{X*) - . . . «0 

XiYXt) -a^^iXt) -a,S(X^H) -a,S(»+*) - . . . -q#2:(X^) ^ 

* For a ntefnl review of the protAem see D.V. Lindleiy, Supp, /. Jfoy. Statist, Sae. 
19*7. %U6. 
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3. Non-Hnear data may sometimes be reduced to the linear form by a 
simple transformation of one or both the variables. 

4. The sum of squares of residuals may be found from the formula 

U=J:(Y*)-a;L{Y)-aj:(YX)- . . . -ap!:{YXP) 

5. One measure of the goodness of fit of the parabola to the data is 
given by R, the correlation between actual and “ predicted " valpes of the 
variate. R is given by 


where Y is the dependent variable. 


EXERCISES 


15.1 Fit a straight line and parabolas of the second and third orders to 
the following data, taking X to be the independent variable — 


X Y 

0 1 

1 1-8 

2 1-3 

3 2*5 

4 6-3 

and find the sum of squares of residuals in the three cases. 

15.2 (Data quoted by P. L. Fegiz, " Le variazioni stagiona^ della 
natality," Metron, vol. 5, 1925, No. 4, p. 127.) The following figures 
show the relation between duration of marriage and average number of 
children per marriage in Norway in 1920 — 


Duration of marriage 
(years) 

0 - 1 
5- 6 

f 10-11 

15-16 

20-21 

25-26 

30-31 


Average number of 
children 
0-48 
209 
3-26 
4*33 
5- 14 
5*63 
5-77 


By the method of least squares find equations of the first, second and third 
orden expressing the number of children in terms of the duration of 
marriage. Compare the values given by these exjHessions for a duration 
(rf 17>18 years witii the true value 4-67. 
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15.3 The pressure of a gas and its volume are known to be related by an 
equation of the form pvV^constant. 

In a certain experiment the following volumes of a quantity of the 
gas were observed for the pressures specified. Find the value of y by, 
fitting a straight line to the logarithms of p and v, taking p to be th4, 
independent variable. 

p (I^. per square cm.) . 0*5 1*0 1’5 2-0 2-5 3*0 

If (litres) . . . 1-62 1-00 0-75 0-62 0-52 0-46 


15.4 The following are the gross output and the gross output per £100 
of labour employed, for a selected number of farms — 

Gross output 


Gross output 
(units) 


per £100 labour 
(units) 


63 

223 

755 

165 


40 

155 

188 

78 


1,535 

3,103 

2,238 

1,228 

2,605 


315 

200 

250 

231 

255 


Fit a quadratic parabola to these data, taking gross output as the in< 
dependent variable. 



. CHAPTER SIXTEEN 

PRELIMINARY NOTIONS ON SAMPLING 


The problem ' 

16.1 In practical problems the statistician is often confronted with 
the necessity of discussing a population of which he cannot exaraihe every 
member. For example, an inquirer into the heights of the population 
of Great Britain cannot afford the time or expense required to measure 
the height of each individual ; nor can a farmer who wants to know what 
proportion of his potato crop is diseased examine every single potato. 

In such cases the best an investigator can do is to examine a limited 
number of individuals and hope that they will tell him, with reasonable 
trustworthiness, as much as he wants to know about the population from 
which they come. We are thus led naturally to the question : what 
can be said about a population when we can examine only a limited 
number of its members ? This question is the origin of the Theory of 
Sampling. 

16.2 A sample from a population is a selected number of individuals 
each of which is a member of the population. As a very special case the 
sample may consist of the entire population. 

It is a matter of common belief, founded on experience and intuition, 
that a sample will tell us something about the parent population. The 
com merchant, whose livelihood depends on his ability to a.scertain 
the quality of the grain which he handles, is content to asse.ss it by thrust- 
ing a conical trowel into the middle of a sack and scrutinising the sample 
he gets. He believes that the sample will be representative of the whole, 
and experience justifies him. He buys and sells on the basis of judgment 
from sauttptes; It is also a matter of common belief that the larger a 
sample becomes the more likely it is to reflect accurately the conditions 
in the parent population. 

To these and similar beliefs the theory of sampling gives a logical 
basis and a system of quantitative measurement. In this chapter we 
give a general survey of the fundamental ideas and the technique of 
sampling. In later chapters we shall develop the.se ideas and discuss their 
appUcations in various fields. 

of pqptdatlon 

liJ Before we conader sampling itself, however, it is desirable to look 

366 
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a little closer into the various types of population which we shall have 
to investigate. 

By a finite population we shall mean a population which contains a 
finite number of members. Such, for instance, is the population of 
inhabitants of Great Britain and the population of books in the British . 
Museum. 

Similarly, by an infinite population we shall mean a population containing 
an infinite number of members. Such, for instance, is the population of 
pressures at various points in the atmosphere, or the population of 
possible sizes of the wheat crop, for, although there are limits to the 
size, the actual tonnage can take any numerical value within those limits. 

In many cases the number of members in a population is so large as to 
be practically infinite. Moreover, a theoretical discussion of an infinite 
population is frequently easier than a discussion of a finite population, and 
a large class of problems may be treated by assuming that the parent 
population is infinite, without introducing any sensible error. 

It may be worth remarking that in a few cases we may be ignorant 
whether or not the population under discussion is infinite. The population 
of stars is an example. 

Existent and hypothetical population 

16.4 By the logical extension of the idea of a population of concrete 
objects, which we shall call an existent population, we are able to construct 
the idea of a hypothetical population. 

Consider the throws of a die. Each throw will be regarded as an 
individual. There is an infinite number of throws which can be made 
with the die, provided that it does not wear out. Let us then define as 
our population of discussion all the possible throws of the die. 

In doing so we are clearly making some new step ; for our population 
is to be conceived as having no existence in reality but only in imagination. 
We can give actuality to some members of the population by throwing the 
die, but we can never produce them all. Even if the die were Ic^ed 
away in a safe and never thrown at all there would still be a population 
of possible throws. 

Such a population is called a hypothetical population. We may define 
it formally as the aggregate of all the conceivable ways in whi^ ^^pedfied 
event can happen. Other examples of hypothetical populations ete the 
population of all values which the bank rate can have in ten years* time, 
and the population of the possible ways in which three balls can be 
arranged on a billiard table. 

16.5 A hypothetical population may, in fact, be imagined around 
any observed event. We have only to picture all the circumstances 
before the event happens ; the population is then all the possible ways in 
which it could happen. Which of the ways it mill happen iteies not aKect 
the population. We know that ''from the chaos of predestination and 
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the night of our forebeing " some one individual will emerge to assume 
the mantle of reality ; but which one that will be is another and xoatt 
difficult question. 

16.6 The student of metaphysics would perhaps criticise the thoughts 
expressed briefly in the previous two sections, but we have no space to 
go further into the philosophical implications of the idea of hypothetical 
populations. The problems which arise in this connection have! however, 
far more than an abstract interest. They lie at the root of a great many 
practical statistical problems, and most students, however utilitarian 
their outlook, will find that a clear perception of the issues involved may 
save a lot of thought and labour at* a subsequent stage. 

Population of populations 

16.7 Just as a population may contain a number of sub-populations, 
so any given population may be a member of some more widely defined 
population. For example, the population of inhabitants of Great Britain 
is a member of the population of populations, each of which consists of 
the inhabitants of some European country. 

Similarly, any existent population may be regarded as one member of a 
hypothetical population of populations. For instance, the normal popula- 
tion of men whose heights have a mean of 65 inches and standard 
deviation 3 inches is a member of the hypothetical population of all 
populations which are normally distributed with respect to hdght. 

16.8 We shall sometimes have to discuss aggregates which it is difficult 
to regard as composed of individual members at ail — for example, we 
may wish to sample a reservoir of water to test for pollution. In theory, 
perhaps, we could in such a case regard the reservoir as a population 
composed of molecules each of which was an individual, but in practice, 
.as we shall see, this is not usually a convenient method of approach. 
Such populations may frequently be treated as composed of arbitrary units, 
e,g. the reservoir may be regarded as composed of so many pints of fluid. 
Similarly, a 280-lb. sack of flour may be regarded as composed of 4,480 
ounces, and we can, if we like, regard it as weighed out into rnie-ounce 
packets. 

16.8 Wo can now turn to discuss the aims which usually underlie a 
samphi^ inquiry. 

Brii^y, the fundamental object of sampling is to give the maximum 
information about the parent population with the minimum effort. We 
must, therefore, consider the type of information we require and the 
metiiods whidi it is to be obtained. 

16J6 In sampling a population we ustially have in mind one or more 
variates. For instance, when we samffle the population of Great 
Biitu^ we are not so much interested in the indtvidi^ as human beings 
•• in one of their qualities, such as height wei|hti ur purhaft the liorr^* 
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tion between height and weight. Our object will then be to get, from the 
sample, an idea of the frequency-distribution in the patent population 
according to the chos«a variates. 

The ideal for the purpose would be to express this distribution in some 
mathematical form such as a Pearson curve (8.48). It may be, however, 
that the parent population will not admit of this representation, or that the 
sample is not large enough for us to venture on it with any confidence. 

In such cases we attempt to find estimates of certain constants of the 
parent population. Very often this is all we need. We can, for example, 
form a very fair idea of the height distribution of the population of Great 
Britain if we know the mean and the standard deviation. If we can go 
further, and find the third and fourth moments, our idea will be better stiU. 
Theory of estimation 

16.11 Hence, a large part of the theory of sampling is devoted to finding 
from the sample estimates of certain constants of the parent population. 
Such constants include the measures of position and of dispersion together 
with the moments and measures of skewness ; and, in multivariate 
populations, the various total and partial correlations. 

In general, there are more ways than one of estimating a constant from 
the data of the sample. Some of these ways will be better than others. 
The Theory of Estinfotion treats of these and cognate matters. It seeks 
to investigate the conditions which an estimate should obey, what are 
the best estimates to employ in given circumstances, and how good other 
estimates are in comparison. 

Predskm of estimates 

16.12 It will be obvious that knowledge derived from a sample is not 
of the categorical kind customary in mathematics. If we have 1,000 balls 
in a bag and draw 999 of them which turn out to be black, it is always 
possible that the remmning one is of some other colour. It is, however, 
so improbable, that in most practical cases we should be justified in cais^ 
eluding that the balls were all black. 

If we did draw such a conclusion, and acted upon it, we should be bso^ 
our action, not upon certainty, but on probability. One does *h»« Mnd 
of thing, of course, in nearly sdl everyday actions almost without notidt^ 
it. Some events, such as the death of a man before reaching the 
150, have such a high degree of probability that we never regSlAtbi^ as 
other than certain ; other events, such as the possibility of rain tb-monow, 
are so uncertain that we should hesitate to make an important dedsion 
contingent upon them. 

16.13 The second aim of the theory of sampling is, therefore, to determine 
as objectively as possible what degree of confidence we can put in our 
estimates when they are obtained. This we do in terms of probahilitjr. 
as far as we can ; if this proves impossible, we sometimes hkve to refy. ' 
on intuitive impresaons or the results of previous expdimiee, lihidt 
are not exprmaUe in. quantitative, terms. 

M* 



3^0 


THEORY OF STATISTICS 


Put in another way, we may say that our object is to determine the 
precision of an estimate. We attempt to do this by assigning limits to 
the probable divergence between the estimate based on the sample and the 
true value of the estimated quantity in the population. 

16.14 The accuracy of the estimate will depend on {a) the way in which 
the estimate is made from the data of the sample, and {h) the way in 
which the sample was obtained. Consideration of the first' leads us 
again to the theory of estimation. The second leads us to study the 
technique of sampling and the design of statistical inquiries. \ 

Tests of significance \ 

16.15 If the sample is small we cannot, as a rule, assign to the estimates 
we obtain suflSciently narrow limits to locate the population value with 
any serviceable accuracy. For example, a correlation of +0*5 in a 
sample of twelve might arise, rather infrequently, from a normal popula- 
tion in which the true correlation was as high as +0*9 or as low as zero. 
For such samples our questions are accordingly framed in more qualitative 

^terms : we do not ask, What is the value of the correlation in the 
«*Sjpopulation ? ** but, Is the observed value significant of the existence of 
■^any correlation at all in the population, whatever its value ? In other 
words, we wish to know whether the observed value could have arisen 
from a population in which the true correlation is zero. If our conclusion 
is that it could not, we may say that the sample value is significant of 
correlation, although we cannot say with much confidence what that 
correlation is. 

Much of the investigation arising out of small samples is thus of a rather 
special character, and deals with tests of significance. The ijaethods 
developed for the purpose of conducting such tests can be, atid not in- 
frequently are, applied also to large samples, either alone or supplementary 
to the direct approach of forming more or less precise estimates of the 
various quantities which specify the parent population. 

Types of sampftig 

16.16 The process of forming a sample consists of choosing a predeter- 
mined number of individuals from the parent population. The choice 
may be exercised in three ways — 

(a) By selecting the individuals at random (the meaning of '* random " 
is discussed below). 

{h) By selecting the individuals according to some purposive principle. 

(c) By a mixture of {a) and {b). 

Thus, in taking a sample of the inhabitants of Great Britain to study 
their income we might, according to method («), select the individuals 
at random from census returns ; or according to {b) we might, knowing 
roughly the average incomes in various age-groups, purposely select from 
each group an individual whose income was somewhere near the average 
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in that group ; or (c) we might decide to take ten individuals from each 
group and select those ten by method (a). 

16.17 Sampling of t 5 rpe (a) is called random sampling. That of type 
{b) is called purposive sampling. That of type (c) is sometimes referred 
to as mixed sampling. If the population is divided into " strata ” by 
purposive methods and then a portion of the sample is taken from each 
" stratum,” the sampling is said to be stratified. 

The application of each of these t 5 ^es may be affected by what is known 
as bias. This is the name given to perturbations which influence the 
nature of the choice and make it something other than what the expeti- 
menter intends it to be. Bias may be due to imperfect instruments, the 
personal qualities of the observer, defective technique, or other causM. 
Like experimental error, it is difficult to eliminate entirely, but usually 
may be reduced to relatively small dimensions by taking proper care. 

By an obvious extension of the nomenclature, we talk of a sample 
obtained by random sampling as a random sample, that obtained by 
purposive sampling as a purposive sample, and so on. 

Random sampling 

16.18 The reader no doubt already has some intuitive ideas abmil"; 
randomness of choice. We may give a formal definition of random ' 
sampling by saying that the selection of an individual from a population is 
random when each member of the population has the same chance of being 
chosen. Similarly, a sample of n inffividuals is random when it is cbosm 
in such a way that, when the choice is made, all possible samples of n have 
an equal chance of being selected. 

16.19 The first question arising out of this definition which we have 
to consider is : How are we to obtain a random sample ? 

This question is more difficult than it appears at first sight. It n^^t 
be thought that any purely haphazard method of selection would givb « 
random sample. For example, if we wished to obtain a random sample oi 
local tradesmen, one way which suggests itself is to take a Trades Directoiy, 
open it '* at random ” and take the first name on which the eye a%b^ 
repeating the process until the sample is of the required size. Or agapn, 
if we wished to obtain a random sample of wheat growing in a field, 
be thought that a satisfactory method would be to throw a hoop .in the air 
*' at random ” and select all the plants over which it felL 

16.20 That such methods are apt to be deceptive may be soM'ifoMn 
the two examples we have just given. In the fiifst, if we consulted a Ttuities 
Directory which had already been used, we should probably find n^.lt 
opened at some pages more readily than at others ; we should 

tend to get the more popular tradesmen. Moreover, our eye ndght 
to be caught by long names or peculiar names. In either case souse trades- 
men would have a greater chance of bdng chosen than others, and tj^e 
sample would not be random. 
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TABLE 1S.1. — ^Height meatofcnieiits of ^/uibiut Frcfoeadet of plants chosen by 

eye In ranks 1-S 

F. Yates, *' Some Examples of Biased Sampling,** Atmals of ISSS, S, 202. 



•JSaks 

{*) 

na D b t ribuH on cf wheat plaalt aceoiSiaf to hrtftit (TaMe ISJi) 
(«} Distributioii of shoot heights (31st BCay) in ranks 14 
(h) DMiSNitioB'of ear heights (SSth Jane) ha ranks 
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Again, in the second example, our hoop might tend to be caught by the 
taller ears of wheat, or we might tend unconsciously to throw it towards 
parts of the field where the wheat looked to be about the average height. 
These and other factors would destroy the random character of the 
sampling. 

Human bias 

16.21 Experience has, in fact, shown that the human being is an 
extremely poor instrument for the conduct of a random sdection. Wher- 
ever there is any scope for personal choice or judgment on the part of the 
observer, bias is almost certain to creep in. Nor is this a quality which 
can be removed by conscious effort or training. Nearly every h uman being 
has, as part of his psychological make-up, a tendency away from true 
randomness in his choices. 

We may illustrate the unreliability of free choice on the part of even a 
trained observer by taking an example of height measurements in sam^des 
of wheat plants. In the course of certain work at the Rothamsted 
Experimental Station, sets of eight wheat plants were selected for measure- 
ment. Six of these shoots were chosen by purely random methods. The , 
other two were chosen " at random ” by eye. If, in any set, the eight 
shoots were ranged in order of magnitude, the two chosen by eye could 
have any places from one to eight; and if they, in common with the other 
six, were really random, they should have occupied these places vrith equal 
frequency in a reasonably large number of sets. Table 16.1 shows the 
resulting frequencies in the ranks one to eight for 116 sets taken on 
31st May (before the ears of wheat had formed) and 112 sets taken on 
26th June (after the ears had formed). 

Fig. 16.1 shows the same results graphically, the dotted line giving 
the firequencies to be expected if the choice was really random. 

The ^vergence of the actual from the expected results is very striking, 
and clearly cannot be attributed to fluctuations of sampling. It will be 
seen that on 31st May, before the ears had formed, the observer was 
strongly biased toward the taller shoots ; whereas in June, afttf the 
ears had formed, he was biased strongly towards a central positiimi and 
avoided short and tall plants. 

16.22 Sight is not the only sense which may bias a sampling method. 
In certain experiments counters of the same shape but of different cdouts 
were put into a bag and chosen one at a time, the counter chosen being 
put back and the bag thoroughly shaken before the next trial, (hi Hw 
face of it this appears to be a purely random method of drawing the 
counters. Nevertheless, there emerged a persistent bias against couotes 
of one particnlar colour. After careful investigatirm the only ex^Lpnthm 
seemed to be that these particular countem were riigh^y nSont ijgevuif 
than the others, owing to peculiarities of the ^gment, and hence riij^ied 
thraugh the Sandler’s fin;^ 
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The student may perform similar experiments for himself. One of 
the simplest is to ask a friend to recite ‘‘ at random " one hundred digits, 
including zero, and then count the number of odd ones. If the numbers 
are really random, the number of even ones and odd ones should be about 
equal, but there will frequently be found a bias one way or the other. 

16.23 Enough has been said to show that if we are to evolve a satisfactory 
method of random sampling we must eliminate all personal choice. The 
method of selection must, therefore, follow some code of procedure which 
leaves nothing to the observer’s idiosyncrasies. \ 

It may sound a little paradoxical to obtain true randomness by follow- 
ing rules of procedure. We are reminded of Bertrand’s question How 
can we talk of the laws of chance, which is the negation of all taw ? ” 
The ensuing sections will, it is hoped, remove any doubts on this head. 

Tedtaifiue of random sanmiiixig 

16J24 The methods adopted in any given case to ensure as far as possible 
that the sampling is random depend to some extent on the size and nature 
of thcf population. Certain modes of procedure which are convenient 
for small populations are not so for large populations. We shall also 
see that sampling from a hypothetical population has a special significance 
and special difficulties of its own. 

16.25 The criterion that every individual should have an equal chance 
of being chosen may be put in a somewhat different form. If the method 
of selection is independent of the properties of the sampled population 
which it is desired to investigate, there will, so far as those properties are 
concerned, be no reason why one individual should be chosen rather than 
another. Hence all values of the properties which occur in the population 
will have an equal chance of being chosen. If, therefore, we can produce 
a mode of procedure which bears no relation to the properties of the 
parent population which we are discussing, we may expect that it will give 
a random sample, so far as those properties are concerned. 

We may now consider a few examples of the kind of procedure 
to which this rule leads. 

Stippose we wish to take a sample of the inhabitants of a street. They 
are already arranged in houses, and for the sake of simplicity we will take 
our problem to be that of selecting a number of houses, whose occupants 
will comprise our sample. 

Let us take as our rule of procedure the selection of every tenth house, 
starting at some arbitrary point. Unless there are peculiar circumstances^ 
it is presumable that the properties we are investigating, which may» 
for instance, be income or size of family, are not grouped periodically 
along the street. The method of selection is then independent of the 
propaties of the population and the sampling will be random. 

If, however, the street were divided into blocks by cross-streets at 
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every tenth house, so that every house in our sample was a comer house, 
and therefore, possibly, a shop, it is easy to see that the sample is no longer 
random. Shops occur, in fact, along that street with period ten, and 
since our method of selection has also that period, the method and the 
qualities under investigation are no longer independent. 

16.27. We might then fall back on a different method. If we take 
a pack of plain cards, as similar as we can get them, we can make one card 
correspond to one of the houses by writing on it the number of the house 
in the street. The pack would then be a kind of miniature of the popula- 
tion for sampling purposes. We can draw a sample of houses by drawing 
a sample of cards, and if we shuffle the pack well we have every reason to 
hope that a random sample wiU result, for it is hard to imagine any way 
in which the method of shuffling and drawing could be dependent on the 
properties of the population. It is not impossible to make it so, however. 
For instance, if the ink with which we wrote the numbers on the cards was 
slightly adhesive, the larger numbers would not be so easy to draw out 
as the small ones, and we should tend to get houses at one end of the 
street. If such houses were of the poorer class, our sample for the purpose 
of investigating income would not be random. 

Lottery sampling 

16.28 The method we have just described, of constructing a miniature 
population which is easily handled, is one of the most reliable methods 
of drawing a random sample. It is the method usually adopted in drawing 
the winning numbers in sweepstakes and lotteries. In such cases the 
population is the aggregate of persons owning tickets in the lottery. To 
every member of this population there corresponds a number, the totality 
of which numbers, written on pieces of paper, comprises the miniature 
population. In practice, these pieces are placed in similar containers, 
usually small metal cylinders, and thrown into a large rotating drum, in 
which they are thoroughly mixed or " randomised.” 

16.29 The practical difficulties of constructing the miniature population 
and of shuffling it are, however, severe if the parent population is at 
all large. The method is, of course, inapplicable on theoretical grounds 
if the population is not finite. To save the trouble of work with tickets it 
is often possible to use numerical methods. 

Suppose we require a set of points on the celestial sphere, as for exatnple 
if stars were uniformly distributed and we wanted a sample of stars. We 
will take a point to be defined on the celestial sphere by latitude and longitude 
(though this is not the way in which astronomers usually express it), and 
ignore difficulties aiising from the existence of double stars or unresolved 
objects. What we want, then, is a set of random pairs of latitudes a nd 
longitudes. As a crude method we might take an atlas of the woild «ld 
choose the figure set out in the index for plates arranged alphabetically. 
But it is easy to see that this method is unsound ; for them will be rnoie 
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names associated with the more populous districts, and hence the values 
given in the index will tend to cluster round certain points and avoid 
others — ^there will be none in the middle of seas or at the poles, so that 
the pole star has no chance of being selected. 

Let us thm take a set of statistical tables and open it haphazardly. 
We shall be confronted with a page of figures, and if we take, say, the tenth 
figure in each row we shall probably get a set of digits which are random. 
Suppose the first ten digits obtained in this way were 7, 0, 4, L 9, 6, 8, 
2, 9, 1. We might then take our star to be defined by latitude yO® 47-9' 
and longitude 68® 29* 1 Another page will give us another and 
so on. 

Random sampling numbers 

16.30 The difficulty in appl 3 dng the method we have just described 
lies in ensuring that the numbers we obtain are really random. Many 
tables of figures, such as logarithm tables, may fail to give random digits 
because there is a relation between the figures in successive rows. To 
obviate this difficulty certain Tables of Random Sampling Numbers have 
been constructed. 

One such set, due to L. H. C. Tippett, consists of 41,600 digits taken 
from census reports and combined by fours to give 10,400 four-figure 
numbers. We give here the first forty sets as an illustration of their 
general appearance — 

2952 6641 3992 9792 7979 5911 3170 5624 

4167 9524 1545 1396 7203 5356 1300 2693 

2370 7483 3408 2762 3563 1069 6913 7691 

0560 5246 1112 6107 6008 8126 4233 8776 

2754 9143 1405 9025 7002 6111 8816 6446 

The reader may wonder how it was ensured that these digits are random. 
They were chosen haphazard, but the real guarantee of their randomness 
lies in practical tests. We may say at once that Tippett’s numbers have 
been subjected to numerous investigations which make their randomness 
for many practical cases highly probable. A further set of numbers 
(100,000 in all) was constructed by Kendall and Babington Smith using 
a randomising machine. These also were carefuUy tested after con- 
struction. The use of random sampling numbers will be apparent from 
the fdlowing examples— 

Example 16.1. — ^To take a random sample of 10 from the population of 
8^ men of Table 4.7, page 82. 

Here we have 8585 individuals. We will number them from 1 to 8585. 
The problem of selecting ten men at random is then that of fim&og ten 
numbers at random between 1 and 8585. We therefore take a page of 
randmn sampling numben and select the first ten mi the page which are not 
gnatcr than 8585. Thus, if our page were the one <« which af^car the 
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numbers we have quoted above, our individuals would be those correspond* 
ing to the numbers, reading across. 

2952, 6641, 3992, 7979, 5911, 3170, 5624, 4167, 1545, 1396 

If we imagine the numbering to be done in order of height, starting with 
the shortest and ending with the tallest, we see that the first individual falls 
in the group 66—*, the second in the group 69—', and so on. The height- 
ranges in which the ten individuals fall are, in fact, in iitches — 

66-, 69-, 67-, 71 -, 68- 66-, 68-, 67-, 65-, 65- 

Let us take their heights as being given by the oratre points of these ranges, 
and find their mean. We have— 

M — Tjj-(66-j-69-J- . . . *1-65) 

=67-2 

Hence the mean is 67 *6 inches, as against the true value of 67 ■ 46 inches in 
the whole population. 

Examplf 16.2. — ^To take a sample of 5 from the distribution of screw 
lengths of Table 4.3, page 72. 

Here we have 206 individuals. It would clearly be a waste to use only 
numbers from 0001 to 0206 for the screws and to neglect the rest, and we 
are able to bring nearly all numbers into play by the following device. 
We note that 206 goes 48 times into 10,000, with a certain remainder. In 
fact, 206 x 48=9,888. We therefore attach 48 numbers to each screw. 
Taking them in order, beginning at the shortest, we let the first screw 
correspond to the numbers 0001 to 0048, the second to 0049 to 0096, the 
third to 0097 to 0144, and so on, the 206th screw corresponding to the 
numbers 9841 to 9888. Numbers above 9888 we leave out of account. 
Referring to the table, we see that there is one screw in the first category 
(5 to 6 thousandths short of an inch), four in the second (4 to 5 thousandths 
short of an inch), and so on. The numbers corresponding to screws in the 
different categories will then be 0001-0048, 0049-0240, 0241-0768, and 
so on ; or, in tabular form. 

We now take five random sampling numbers from the tables. For 
instance, we might take the five in the first column of 16.30, i.e. 2952, 
4167, 2370, 0560, 2754. The screws corresponding to these numbers wiU 
be 1 ‘5, 0*5, 1 *5, 3*5 and 1 *5 thousandths short of the inch respectiv<dy. 

If we had obtained two numbers, say 0001 and 0002 in the fi^ categmy, 
we should have been faced with the necessity for a decision tm how 
sampling was to be r^;arded, for there is only one screw in this 
If we suppose thatasamided screw is abstracted from the popuiafk^^caii 
only be <^wn once and hence we shouht have had to ignme aU KKiHiilMm 

hi the cattery 0001 to 0048 snhseqiunt to tiiat ndnch fiiKt 

tiiw otlwr ^d, ^ screw it n^plac^ we can draw it aaeilM «aw» fika. 
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Difference in 
length from 
i inch 

(thousandths) 

Numbers 

corresponding 

Difference in 
length from 

1 inch 

(thousandths) 

Numbers 

corresponding 

-6 to 
—5 to —4 
-4 to —3 
-3 to -2 
-2 to -1 
-1 to 0 

0 to +1 

0001—0048 
0049—0240 
0241—0768 
0769—1824 
1825—3024 
3025—4320 
4321 -5856 

to ^2 
+2 to -t-3 
+ 3 to +4 
+ 4 to +5 
-f" 5 to -j- 6 

5857—7488 
7489—8688 ; 
8689—9456 j 
9457--...9840 
9841—9888 


Example 16.3. — In Example 2.5, page 25, we had the following data 
giving the association between inoculation against cholera and exemption 
from attack in 818 subjects — 



Not attacked 

Attacked 

Total 

Inoculated 

276 

3 

279 


(0001-3312) 

(3313-3348) 


Not inoculated . 

473 

66 

539 


(3349-9024) 

(9025-9816) 1 

.... .. j 


Total 

! 

749 

i 

69 

818 


Let us take a sample of 10 from this population. 

We observe that 818 goes into 10,000 twelve times, with a certain 
rexiidinder. In fact, 10,000=12 x818-|-184. We can therefore attach 
12 random sampling numbers to each member'of the population. To the 
276 inoculated-not-attacked individuals we attach the numbers 0001 to 
3312 (12 x 276). To the 3 inoculated-attacked individuals we attach the 
numbers 3313 to 3348 (a range of 36, equal to 3x 12). Similarly for the 
remaining individuals. The random sampling numbers corresponding to 
the individuals in the four compartments of the table are shown in brackets 
above. 

We then take ten random sampling numbers from the tables, say the 
first ten, reading across, from the numbers given in 16.30. If we had 
come across a number greater than 9816 we should have ignored it. The 
first number, 2952, gives us an individual falling in the inoculated-not- 
attacked class ; the second, 6641, gives us a memt^ of the not-inoculated- 
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not-attacked class ; and so on. The 10 numbers give the following 
results — 



Not attacked 

Attacked 

Total 

Inoculated 

2 

0 

2 

Not inoculated . 

6 

2 

8 

Total 

8 1 

2 

1 

10 


Example 16.4. — Strictly speaking, random sampling numbers are 
applicable only to sampling from a finite population, for we cannot attach a 
different number to each member of an infinity aggregate. But, by the 
following device, we can apply the tables to draw samples from a con- 
tinuous (and therefore infinite) population which is specified by a mathe- 
matical equation in such a way as to give us the proportion of the total 
frequency in given ranges of the variate. 

In fact, let us draw a sample from a normal population with unit 
standard deviation and unit total frequency. 

Let us take ranges of 0 • 1 on each side of the central ordinate. Table 2 
of the Appendix will then give us the proportion of the frequency lying 
in these ranges. As in Example 16.1, we divide up the numbers from 
0000 to 9999 in proportion to these frequencies, and this is, in fact, a par- 
ticularly simple matter. All we have to do, for the positive values of the 
variate, is to take the figures in the table, which have four figures. For 
example, for the first interval 0*0 to O’l, there will correspond the 
numbers 5000 to 5398; to the interval 0*1 to 0*2, the numbers 5399 
to 5793 ; to the interval 0*2 to 0-3, the numbers 5794 to 6179; and 
so on. For the negative values of the variate we have, similarly, for 0*0 
to —0*1, the numbers 4601 to 4999 ; for — 0* 1 to —0*2, the numbers 4206 
to 4600 ; for —0*2 to — 0*3, the numbers 3820 to 4205 ; and so on, there 
being as many numbers in any negative range as in the corresponding 
positive range. Occasionally doubt may arise in assigning a number to a 
given interval owing to the difficulty of rounding up a figure ending in 5. 
In practice it is not likely to make any difference which interval we 
choose ; if it threatens to do so, we can take the doubtful number to refer 
alternately to the two possible intervals. 

Having assigned numbers to the ranges, we select from the random 
sampling numbers tables in the ordinary w^ay. For instance, a number 
5500 will correspond to a member in the range 0*1 to 0*2. If we wish 
to ascertain the mean of a sample, or some similar function of the variate 
values, we take the variate value of any individual to be the centre of the 
interval in which it falls. This is an approximation, but the narrowness 
of the intervals justifies it in most practical cases. 
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San^iBg from infinite pt^tulationt 

16.31 The methods we have just been discussing are appropriate only 

to those cases in which the population is finite, so that’ it was possible 
to associate with each individual one or more random sampling numbers ; 
or to populations which, though infinite, can be treated by the method 
of Example 16.4 owing to their complete specification according to the 
variate under discussion. The required conations are met with in much 
of the material treated in practice, particularly in demographic and 
economic work ; but in other work the population may be either infinite 
or so large as to be infinite for all practical purposes, and a diSerent 
technique must therefore be used. \ 

Consider, for example, the problem of drawing a random sample fro^ 
a sack of Hour. We clearly cannot number air the particles in the sacl^ 
nor could we extract any given particles and examine them. We mighty 
perhaps, reduce this case to that of a finite population by weighing out the 
flour into small, say one-ounce, packets and then sampling the packets. 
This is a kind of mixed sampling. But it is also possible to handle the 
problem by a special technique, as follows. 

First of all, we mix the flour thoroughly. We then divide it into 
two halves and select one half. (It does not matter which, but for con- 
venience we may imagine two heaps, one on the right and one on the left, 
and select left and right alternately.) We then di'vide the half we have 
chosen into two further halves, and again select one. The process is 
continued until the sample has reached a manageable size. We may 
reasonably suppose that it is random, especially if the flour is well mixed 
at each stage before being divided into two. 

A similar technique may be used for many '* continuous " substances, 
such as milk, grain, cement, etc. 

Sampling from hypoflietical p(q>ulatiotis 

16.32 The technique for drawing random samples brings out a funda- 
mental difference between existent and hypothetical populations. Taking 
a simple but typical case, let us draw a sample from the population of 
throws of a die. 

The methods we have previously used are quite obviously inapplicable 
here. We cannot construct a card population, becat^ we do not know 
file nature of the parent population. Nor can we put all the possiUe 
throws in a heap, and select from it by continued subdivision. In fact, 
there is only one thing we can do, and that is to throw the die, and take 
our results as a sample. 

What reason have we to suppose that this is a raniom sample ? The 
answer lies partly in theory and partly in technique. In the first place, 
we must adapt our method of throwing so that the sampling conditions, 
•6 far as we can see, remain constant throughout the experiment. This 
li a matter of techaaque, and tm methods can, in tect. he tested. Bnt 
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since our population does not exist for us to examine separately, the only 
knowledge about it being derived from the sample itself, it will be 
on a little reflection how difficult .it is to say that every other possibility 
in the population had an equal chance of occurring. We return to this 
point in 16.35 and 16.36 below. BaacHly our assumption is that our 
throws behave as if they were being chosen at random from an existent 
population. The justification for this is our general knowledge of the 
behaviour of dice. 

The importance of random sampling 

16.33 We have already remarked on the importance of being able to 
gauge the error of an estimate made from a sample. The practical use 
of the theory of random sampling lies largely in the fact that it allows 
us to measure objectively, in terms of probability, errors of estimation or 
the significance of a result obtained from a random sample. The purposive 
methods to which we refer below do not do this, or at least have not yet 
been made to do so. The present trend among statisticians is, therefore, 
on the whole, in favour of the use of random sampling methods except ia 
certain special cases. 

16.34 At this point we may bring forward two important comdderatieiis. 

In the first place, it must not be forgotten that random sampling may 

produce the most unrandom-looking results. For instance, we rauafly 
regard a hand of cards at bridge as a random sample from the population 
of 52 which comprise the pack ; but it is not unknown for a hand of 
13 spades to be dealt. The fact that the sample looks purposive, there- 
fore, proves nothing. But it does provide a basis for strong presumptions. 
How strong those presumptions may be the student may judge for himself 
by imagining what he would think of a card party at which he got 13 
spades twice in succession. 

Secondly, we can never be absolutely certain that a method of sampling 
is random. There are doubts on a priori grounds because for any given 
method there are always conceivable sources of bias, and we can never 
rule out entirely the possibility that some of these sources are present. 
The utmost we can do is to make their presence extremely unlikdy by 
taking great care with the experiment. 

16.35 We can, however, apply tests to judge the randomness of a 
sampling method. If we draw a single sample from a known population, 
the result will tell us nothing about the method adopted; but if we take 
a large number of samples they should, if the sampling is ’random, be 
distributed in a certain way, and for some populations we can calculate 
mathematically what that way ought to be. If, therefore, we apply our 
sampling method to such a parent population and find the results widefy 
divergent from expectation, we have every reason to suspect our sampling 
technique. Per contra, if the results and expectation are in accord, there 
is good gioond for reliance OR the sampling. 
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16t36 Tests of this kind presuppose that we know the form of the parent 
population. In sampling from a hypothetical population we do not 
know this, and are forced to estimate it from the sample. Clearly, we 
cannot use this estimate to criticise the method by which the sample was 
obtained without some closer inquiry. 

Similar problems may arise for existent populations when we do not 
know the nature of the parent population but have to estimate some or all 
of its characteristics from the data of the sample. In such cases it is 
extremely difficult to be completely satisfied that the sampling is random. 
Fre(|uentiy the best we can do is to use a method which has been foi^nd 
satisfactory for other populations and hope, in the absence of any indica- 
tion to the contrary, that it will also be satisfactory for the present 
population. 

Purposive sampling 

16.37 We have already pointed out the dangers of introducing bias 
if the observer gives rein to his inclinations in choosing a sample, and 
have stressed the fact that in general there does not exist a method of 
assessing the degree of accuracy of an estimate made from a purposive 
sample. In spite of these handicaps, however, there are cases where 
purposive selection is a useful method. In this book we shall not con- 
sider it in any great detail, because the reliance placed upon it depends 
largely on the circumstances of the case, remains to a great extent a 
matter of personal opinion, and is not capable of being discussed by 
elementary methods. Nevertheless, our brief survey would be incomplete 
without some reference to it. 

16.38 Let us first of all consider the case of an observer who wishes 
to take a sample of two or three turnips from a cart-load. A random 
sample might give us several very large or very small turnips, though it 
is unlikely to do so. But if we allow the observer to run his eye over the 
whole load and then choose, he is most likely to take what he regards as 
average turnips — i.e. average in size, weight, shape, and whatever other 
quality may be in his mind. 

It may be claimed, with some plausibility, that this purposive method 
is more likely to give us a sample which is typical or representative of the 
population than a random method. The random sample may vary widely 
from the average, whereas the purposive sample does not. This gives 
the latter an advantage as a rule ; but it may be pointed out — 

{a) That as the sample becomes larger the random sample becomes 
more and more representative of the parent, whereas, owing to bias, the 
purposive sample in general does not. 

(6) That in many cases the object of the sample is to give us information 
about the whole of the population ; the purposive sample might tell us 
more about the mean weight of the turnips, but would probably give a 
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worse idea of the variance of the weights because the observer has 
deliberately chosen values near the mean. 

16.39 If we had to choose between pure random sampling and purposive 
sampling, our choice would probably be determined by balancing the 
uncertainties of the former, which are mainly due to fluctuations of 
chance, and the uncertainties of the latter, which are mainly due to bias. 

In practice, however, it is often possible to combine the two methods 
in stratified sampling and gain some of the advantages of each while 
minimising their disadvantages. 

The essentials of this process lie in dividing the parent population into 
strata and taking a random sample from each stratum. For instance, if 
we are taking a sample of earned incomes, we might first group individuals 
into classes “ earning up to £500 per annum,” ” earning from £500 to 
£1,000 per a nn u m ,” and so on, and then choose a random sample from each 
class. Or, if we wanted a sample of farms in Great Britain, we might first 
classify them roughly as ” devoted mainly to arable crops,” " devoted 
mainly to milk production,” " devoted mainly to vegetable growing," etc,, 
and again take a random sample from each group. 

16.40 Finally, we may also sample a population by first of all arranging 
its individuals in groups. This amounts to taking a different sampling 
unit. For instance, in sampling the population of Great Britain we might, 
as a matter of convenience, take streets or local government districts 
instead of individuaJ human beings as our unit. We have already had an 
instance of this type when we suggested as one way of sampling a sack of 
flour that it might be weighed out first into one-ounce packets. The 
process is obviously more convenient when this grouping has been done 
for us, e.g., in census returns. 

16.41 Each branch of science and industry presents its own sampling 
problems, and it would be difficult to expand the foregoing discussion so as 
to include the detailed requirements of the worker in every sphere. We 
shall revert to the general subject of sampling in Chapter 23, and conclude 
this chapter with an example of the way in which all the methods we 
have described may be pressed into service in order to give a sample 
which is as representative as practical limitations wilt allow. 

It is the practice in England for manufacturers of sugar from sugar beet 
to pay the grow’ers according to the sugar content of their product. The 
beet, which is not unlike a parsnip, is delivered to the factory in lots of at 
least several tons with a certain amount of waste material, such as earth, 
adhering to it. The problem is, then, (a) to find the net weight of the beet 
when cleaned and ready for the slicing process, which is the first stage in 
the extraction of the sugar, and (i) to ascertain the sugar content. The 
method of procedure is as follows — 

The gross weight of the load of beet usually is first obtained by weighing 
the lorry which contains it when full, and when empty. From the naiddle 
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of the load of beet is then abstracted about 28 pounds, which is carefully 
weighed, and then cleaned and weighed again. The difference in the 
weights gives the " tare,” that is to say, the proportion of waste matter, 
and a proportional amount is deducted from the whole load to give the 
net weight of beet. This process is equivalent to taking a random sample 
and assuming that the value of the “ tare ” in the sample is the value in 
the whole population. 

The sample of washed beet is then laid out on a table and arranged with 
the roots in order of size. From this sample a smaller sample is taken by 
choosing a beet every so often. This is a process of pure purposive selectioh. 

The reduced sample is still inconveniently large, so it is reduced by 
taking a slice from each beet. It is known that the sugar in the root is not 
distributed homogeneously (although it is roughly symmetrical about th)p 
axis of the root), so trained men are employed to slice one section -with a 
rasp, the section being that which would be obtained by cutting the root 
from the thick end to the tapered end into two symmetrical halves and then 
repeating the process one or more times. This selection again is pur- 
posive in so far as the shape of the section is based on knowledge of the 
^stribution of the sugar, but random in so far as it is a matter of chance 
what is the longitude of the particular slice chosen. 

When each beet has been treated in this way there is given a heap of 
pulp which may be analysed. The heap is, however, as a rule still too 
large. It is therefore well mixed and divided into four heaps. Two heaps 
are thrown away, one is reduced to 26 grammes and analysed by the factory 
and one, similarly reduced, is analysed by the grower’s representative. 
This last method of selection is a random method adapted for a population 
which cannot readily be enumerated. 

The final sample therefore appears as the result of fow successive 
sampling methods, two of which are random, one purposive, and one a 
mixture of purposive and random. 

SUMMARY 

1. Sampling may be random, purposive or mixed. . 

2. Random sampling owes its importance to the fact that we can assess 
the results obtained from it in terms of probability. 

3. The presence of an element of choice on the part of the observer 
introduces the danger of bias, and should not be pormifted where it can be 
avoided. 

4. Random samples may conveniently be drawn by the use of card 
populations or of random sampling numbers. 

5. The sampling technique adopted in any given case will depend largely 
on the circumstances of that case and the resources of the observer. At 
the present time the reliability of estimates made from samples is partly a 
matter of individual opinion fotmded on intuitive ideas, unless the sampling 
methods are random. 
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EXERCISES 

16.1 Draw a random sample of 20 from the population of men of the last 
column of Exercise 4.6 (inhabitants of the United Kingdom clasafied 
according to weight). Find the mean of the sample and compare it with 
the mean of the population. 

16.2 Deal yourself a hand of 13 cards from an ordinary pack of 52 playing 
cards and count the number of court cards. Use your result to estimate 
the number of court cards in the whole pack. 

Repeat the experiment ten times, taking a new deal each time, and com- 
pare the mean of your results with the true value, 12. 

16.6 Suggest a method for obtaining a random sample of words from the 
English language by the use of random sampling numbers and a dictionary. 

16.4 Draw a sample of 30 from the population of the last column of 
Table 4.7, and find the standard deviation. Compare your result with the 
standard deviation of the population. 

16.5 Suggest a possible source of bias in the following — 

(a) A barrel of apples is sampled by taking a handful from the 
top. 

(b) A mixture of sand and sawdust is sampled by scooping up 
a quantity from the bottom. 

(c) A set of digits is taken by opening a Telephone Directory at 
random and choosing the telephone numbers in the order in 
which they appear on the page. 

{d) Readers of a newspaper are sampled by printing in it an 
invitation to them to send up their observations on some 
topical event. 

{e) Investigators into the size of families in a town conduct a 
house-to-house inquiry (1) in the morning, (2) in the after- 
noon, ignoring those houses at which there is no reply. 

16.6 Draw 100 samples of 10 from a normal population by means of 
random sampling numbers, and form the frequency-distribution of their 
means. 

16.7 In the data obtained in Exercise 16.6, form the frequency-distribu- 
tion of the root-mean-square deviations of the samples about the mean 
of the parent population. 

16.8 Draw 100 samples of 10 from the Poisson population of 8.47, page 194, 
and form the frequency-distribution of their means. 

16.9 Draw 500 samples of 4 from the population of Australian marriages 
of Table 4.8, page 84, and form the frequency-distribution of their range. 

16.10 Draw a sample of 50 from the population of Table 9.4, page 204 
(4912 dairy cows), and find the correlation in the sample between age in 
years and yield of milk per week. Compare your result with the dMcielft* 
lion in the population. 
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The pnAlem \ 

17.1 In dealing with the theory of sampling we shall find it convenient 
to preserve the formal distinction between attributes and variables 
which we drew earlier in this book. The theory of the sampling of 
attributes is in many respects simpler than that of variables, and in this 
chapter we shall confine ourselves to it. We shall begin by considering 
a type of sampling which we shall call simple, involving certain limitations 
on the generality of the problem, and shall then proceed to examine the 
removal of these limitations in order to deal with the general case. 

17.2 The sampling of attributes may be regarded as the drawing of 
samples from a population containing A’s and not-A’s. The number of 
A's in each sample, or the proportion of A's, will form part of the data 
provided by the samples. 

We shall find it convenient to adopt the nomenclature of 8.3 and to 
speak of the drawing of an individual on sampling as an " event.” The 
appearance of the attribute A may be called a " success ” and the non- 
appearance a “ failure.” Thus, in sampling a human population for the 
proportions of the two sexes, we might say of a sample of 100, 45 of which 
were male, that the sample consisted of 100 events, 45 of which were 
successes and 55 failures. (It might, of course, be more convenient— 
and would certainly be more courteous — to reverse the names and call 
the occurrence of a female a ” success ” and of a male a “ failure.”) 

Simple samfding 

173 By simple sampling we mean random sampling in which each 
event has the same chance p of success, and in which the chances of 
success of different events are independent, whether previous trials have 
been made or not. These conditions hold good, for instance, in the 
throwing of a die or the tossing of a coin ; the chance of getting heads 
with a coin is not affected by what was obtained on the previous trials, 
and remains constant no matter how many trials are made, provided, of 
course, that the coin does not begin to wear or is not falsely manipulated 
by the experimenter. 

Simple sampling is a particular form of random sampling, as we have 

386 
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defined it in the previous chapter. Suppose^ for example, we take a 
sample of two from a population consisting of 6 men and 4 women under 
random sampling conditions, i.e. so that at each of the two events which 
constitute the sample every member of the population has an equal chance 
of being chosen. If, at the first trial, we draw a man, the chance of doing 
so being *0 ^ there will be 5 men and 4 women left in the population, and 
the chance of obtaining a man on the second trial will be This is not 
the same as the chance on the first trial, and hence the sampling is not 
simple, though it is random. 

Mean and standard deviation in simple sampling of attributes 
17.4 Suppose now that we take N samples with n events in each. The 
chance of success of each event is p and of its failure As in 

8.6, the frequencies of samples with 0, 1 , 2, . . . successes are the terms 
in the series N{q+p)*^, i.e. 


N 


q^-\-nq^-^p- 


«(n — 1) 




As in 8.9, this distribution has mean M given by 


M^np 


and standard deviation (8.10) 


a—Vnpq .... (17.1) 

17.5 In lieu of recording the number of successes in each sample we 

might have recorded the proportion of successes, that is, ^th of the 

number in each sample. As this would amount to dividing all figures 
of the record by n, the mean proportion of successes must be p, and the 
standard deviation of the proportion of successes is given by 

>“V? 

Equations (17.1) and (17.2) are of fundamental importance. 

Example 17.1. — ^The following results, due to Weldon, are of interest. 
Weldon threw 12 dice 4,096 times, a throw of 4, 5 or 6 being called a 
success. We have, then, 4,096 samples of 12 from the population con- 
sisting of all possible throws of the dice. 

If the dice are all true, the chance of success is J. Hence, the theoretical 
meanMs=6; theoretical value of thestandard deviation o ssy'O S xO* 5 X 12 
=1-732. 
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The following was the frequency-distiibution observe — 


Successes 

Frequeucy 

Successes 

Frequency 

0 

— 

7 

847 

1 

7 

8 

536 

2 

60 

9 

257 

3 

198 

10 

71 

4 

430 

11 

11 

5 

731 

12 

— 

6 

948 

Total 

4,096 


Mean Af=6'139, standard deviation a==l*712. The proportion\ of 
successes is 6*139/12=0-512 instead of 0’5. 

Example 17.2. — (G. U. Yule.) The following may be taken as an illustra- 
tion based on a smaller number of observations ; Three dice were thrown 
648 times, and the numbers of 5’s or 6’s noted at each throw. ^=1 /3, 
?=2/3 ; theoretical mean 1 ; standard deviation O’ 816. 

Frequency-distribution observed — 

Successes Frequency 

0 179 

1 298 

2 141 

3 30 

Total 648 

M=1’034, 0=0-823. Actual proportion of successes 0-345. 

17.6 The value pn is sometimes called the " expected ” value of the 
number of successes in the sample. It is not only the mean value of 
all samples, but is the most probable value and is also representative, t.e. 
it bears the same ratio p to the number in the sample as the numbitr of 
individuals with attribute A in the population bears to the total number 
in the population. The divergences of the number of successes from the 
expected value in any given random sample give rise to what we have 
hitherto called fluctuations of random sampling. They are to be regarded 
as deviations due to the nature of the sampling; process, and not indicative 
of any real properties of the population itself. ~ 

17.7 Equations (17.1) and (17.2) enable us to deal with the question 
which has arisen several times in earlier chapters of this book, namely, 
when can we say that observed deviations from the expected values in 
a sample of attributes are due to some real effect and are not merely 
attributable to sampling fluctuations ? 

The binomial distribution, to which samples clasafied accordiiiS to 
the frequencies of an attribute give rise, is a single-humped type which 
approximates very dosely to the normal for large values of n, tto numiMr 
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in the sample. It follows that the great majority of its mem bers lie 
within a range ±3 <t on each ade of the mean, i,e. of on each 

side of the value np. If the distribution is exactly normal, 0‘9973 of the 
curve lies within this range (8.29). We can therefore say that if a 
particular sample gives a value of p outside this range, the deviation from 
the expected value is most unlikely to have arisen from fluctuations of 
simple sampling. If n is large, the chances are about 3 in a thousand 
that it arose in that way. 

It must be emphasised that the free use of the 3 <t rule is justified only 
if n is large. 

Example 17.3. — In the experiments of Example 17.1, 25,145 throws of 
a 4, 5 or 6 were made out of 49,152 throws altogether. The chance of 
throwing one of these numbers is J, and hence the expected value is 24,576. 
The observed number was thus 569 in excess of this. Can the deviation 
from the expected value be due to fluctuations of simple sampling ? 

The standard deviation of simple sampling is 

a= V«^= VfxiX 49152 
=110'9 

The deviation observed is 5*13 times this quantity, and it is therefore 
most improbable that it arose as a sampling fluctuation. We must there- 
fore seek some other explanation of the deviation, and it seems reasonable 
to suspect that the dice were slightly biased. 

The problem might, of course, have been attacked equally well from 
the standpoint of proportion instead of the actual numbers of successes. 
This proportion is 0*5116 instead of the expected 0*5000, the difference 
in excess being 0*0116. The standard deviation of the proportion is 

and the difference observed is 5* 13 times this, which is the same ratio as 
before, as of course it must be. 

Example 17.4. — (Data from the Second Report of the Evolution Com- 
mittee of the Royal Society, 1905, p. 72.) 

Certain crosses of the pea, Pisum sativum, gave 5,321 yellow and 1,804 
green seeds. The expectation is 25 per cent of green seeds on a Mendelian 
hypo^esis. Can the divergences from the expected values have arisen 
from fluctuations of simple sampling only ? 

The numerical difference from the expected result is 23. The standard 
deviation of simple sampling is 

o=V0-25 x 0-75 x 7125=36'6 

The divergence from theory is only about 0*8 of this, and bmce may 
very well have arisen from fluctuations of sirnffle sampling. 
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Standard error 

17.8 We shall very frequently have to use the standard deviation of 
sampling, and it is convenient to have a shorter name for this quantity. 
We shall call it the standard error. The use of the word error is justified 
in this connection by the fact that we usually regard the expected value 
as the true value, and divergences from it as errors of estimation due to 
sampling effects ; but the student should not attach too much significance 
to the particular term “ error." 

In most of our work the term " standard error " will be applied to the 
standard deviation of simple sampling ; but it has a rather wider meaning, 
embracing this one, which we shall discuss in considering the sampling of 
variables ( 18 . 22 , cf. also 17 . 31 ). ^ 

We may, then, summarise the foregoing in the statement that fre- 
quencies differing from the expected frequency by more than 3 times the 
standard error are almost certainly not due to fluctuations of sampling. 
They point to some departure of the sampling from simplicity, which may 
in turn point either to some flaw in the sampling technique or to causal 
efiects in the population itself. 

ProbaUe error 

17.9 Instead of the standard error, some authorities have used a quantity 
called the probable error, which is 0-67449 times the standard error. This 
practice arose from the fact that in the normal curve the quartiles are 
distant 0 -674490 from the mean, so that the probability that a deviation 
is in excess of the probable error is and is equal to the probability of a 
deviation being less than the probable error. The rule that the observed 
deviation should not be greater than 3 times the standard error is then 
approximately equivalent to a rule that it should not exceed 4-5 times 
the probable error. 

The use of the probable error is declining, and we recommend the student 
to eschew it. 

17.10 In Examples 17.1 to 17.4 we dealt with cases udiere f, the 
probability of success, was known a priori. In many cases it is not known, 
and further consideration is necessary before we can ap{fly equations (17.1) 
and (17.2) to such cases. 

To fix the ideas, let us suppose that we have a simple sanqple of 1,000 
individuals from the inhabitants of Great Britain, and find that 36 per cent 
of them have blue eyes and the remainder have eyes of some other colour. 
What can we infer about the proportion of blue-eyed individuals in the 
whole population ? 

In this instance we do not know the proportion p of blue-eyed in- 
^viduals in the population. We do know that £he standard enfor is 
VWOOpq. Now, whatever p and q an.pq cannot exceed J, and hence the 
itendard error cannot exceed |VlOOO, or 16. Hence, whatever p kt, * 
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simple sample should give a number of successes within 3 times this, or 48, 
of the expected frequency pn. This is 4*8 per cent of the sample, and we 
thus may say that the proportion of blue-eyed people in the whole popula* 
tion is ^±4*8 per cent, i.e. that it lies between 31 *2 and 40-8 per cent, 

17.11 We may, however, make a rather better estimate. We have 
seen that the standard error is small compared with the expected value, 
and hence with the observed value. If, therefore, in calculating the 
standard error we take the observed values of p and q in the sample instead 
of the unknown true values of p and q, we shall not involve ourselves in 
very great error. 

Thus, taking ^ to be 0*36, ^=^0*64, 

a=Vn/)y = V0*^x0*64 x KKX) 

=15*18 

Hence, 30^45*5 approximately, and the limits are now 36±4*6 or 
31*4 and 40*6 — slightly narrower than those previously obtained. 

17.12 In this example we have taken the proportion of successes in 
the sample to be an estimate of the proportion of successes in the popula- 
tion, and have set limits to the range within which the true proportion 
probably lies. There are other reasons, of an advanced theoretical 
character which we shall not specify, for taking p in the sample as an 
estimate of p in the population, but the student will probably concede 
that it is the most reasonable thing to do in the circumstances. We must, 
however, look a little more closely into the assumption that this estimate 
may be used in calculating the standard error. 

17.13 The assumption is a justifiable one if n is large and neither p nor 
q is small. For in such a case, the standard error of the proportion p is 

and this is small compared with p unless p itself is small. 

ft 

If, then, the standard error of p is small, the value of p estimated from 
the sample must be close to the real value, and we shall not introduce any 
serious error by taking the estimated value in evaluating the formula 

v^- 

17.14 Precisely how large n must be for this approximation to be valid 
it is not easy to say. Samples of 1 ,000 are almost certainly large enough, 
and we may often apply the foregoing procedure with considerable 
confidence to much smaller samples, say of 100. For samples below that 
figure it is as well to examine carefully the circumstances of any given case 
and to proceed with caution. 

We shall have more to say on this matter when we consider the sampling 
of variables (18.17 and 18.18). 
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For the remainder of this chapter we shall assume that our samples 
are “large," that is to say, that the approximations involved in our 
assumptions as to the estimate of p are valid. 

Example 17.5 . — k sample of 900 da 3 rs is taken from meteorological 
records of a certain district, and 100 of them are found to be foggy. What 
are the probable limits to the percentage of foggy days in the district ? 

Anticipating somewhat our discussion of simple sampling, we wiU 
assume that the conditions of this problem give a simple sample. 

Hence, 

P=\. 9=1 

Standard error of the proportion of foggy days 



=0-0105 


=1-05 per cent. 

Hence, taking J to be the estimate of the number of foggy days, we have 
that the limits are 11-11 per cent ±3-15 per cent, i.e. 8 per cent and 
14-25 per cent approximately. 

Example 17.6. — A biased penny is tossed 100 times and comes down 
head$ 70 times. What are the probable limits to the probability of getting 
a head in a single trial ? 

We require to know the limits of p. If we assume that 100 is a large 
sample, we have — 

V?-Vi5o44='>-<«“ 

The limits are therefore 0-70±(3x 0-0458) 

=0-70±0-1374 

=0-56 and 0-84 approximately 

If we feel any doubt as to the validity of using estimates of p and q 
from a sample of 100 in calculating the standard error, we may proceed 
as follows — 

The standard error of p cannot exceed A/f»x Jx i.e. 0-05. Hence 
the value of p lies almost certainly within the limits 0-70 ± 0- 15, i.e. 0-55 
and 0-85. 

If ^ =0 - 55, =0 • 04975 

^^= 0-03571 


If 
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For intermediate values 




lies between these limits. Hence the 


maximum value of the standard error is 0*04975, and f lies between the 
limits 0*70 db 0*14925, i.e. 


0*55075 and 0*84925 

It will be seen that these limits are nearly equal to those obtained on 
the assumption that and are not very different from those we 

got by assuming ^=0*70. There would, however, be an appreciable 
difference if p had been small, say 0*10. 

17.15 If one of the two proportions p and q becomes very small, equation 
(17.1) may be put into an approximate form that is very useful. Suppose 
p to be the proportion that becomes very small, so that we may neglect 
compared with p ; then 

i 

pq=p—p^=^p approximately 
and consequently we have approximately — 

a=^V^=VM .... (17.3) 

That is to say, if the proportion of successes be small, the standard 
deviation of the number of successes is the square root of the mean number 
of successes. Hence we can find the standard error even though p be 
unknown provided only we know that it is small. 

This is, in fact, the case when the binomial becomes the Poisson series 
(8.40). For such distributions the rule that a range of 6a includes the 
great majority of the observations remains valid, as may be seen from 
the diagram on page 192, but the limits assigned to the standard error of 
the mean M may be too wide on the left of the mean. For example, if 
M si, asl, and a range of 3 units to the left of the mean carries us to a 
value of —2, whereas there can be no part of the frequency with negative 
values of the variate. 

17.16 It vrill be noticed that the standard error depends only on the 
value of p and the size of the sample, and that therefore the range within 
which p probably lies is independent of the size of the population. This 
appears a little paradoxical, because one might expect that a sample 
which was, say, 20 per cent of the population would enable doser Smits 
to be set than one which was 10 per cent of the population. The orifinary 
man nearly always believes that a sample of only 1 /lOOO of the population 
necessarily gives much less trustworthy rraults than a sample of say, 1 /lO, 
without regard to its actual size, but the belief is quite unjustified. 

The explanation is to be found in the nature of ample samfdl^ itself. 
We shall see overleaf that the conditions under which ample sampling arises 
in practice are such that either the population is actually or praddoally 
infinite, or each member drawn for a sample is put back in the population 

M 
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before the next is drawn. In either case the population is inexhaustible, 
and no sample is any nearer to including all its members than anothei 
sample. It is, therefore, not surprising to find that the size of the popula* 
tion does not appear in the formula for the standard error. 


17.17 A further notable fact is that the standard error of p varies 
inversely as the square root of n, and not inversely as n itself. Thus, as 
n becomes larger the standard error becomes smaller, which is what we 
should expect, but the standard error decreases proportionately to the 
square root of For instance, if a sample of 100 gives us a standard 
error of 10 per cent, it will take a sample of 400 to halve that error, and 
a sample 100 times as large, i.e. 10,000, to reduce the error to one-t^nth 
or one per cent. 

Precision 




17.18 The standard error may fairly be taken to measure the unreliability 
: of an estimate of p ; the greater the standard error, the greater the 
fluctuations of the observed proportion, although the true proportion 
is the same throughout. The reciprocal of the standard error (1 /s), on 
the other hand — or some convenient multiple of the reciprocal — may be 
regarded as a measure of reliability, or, as it is sometimes termed, precision, 
and consequently the reliability or precision of an observed proportion 
varies as the square root of the number of observations on which it is based. 


The limitations of simple sampling 

17.19 In order to realise the limitations on the use of the formulae of 
equations (17.1) and (17.2), it is necessary to consider what are the con- 
ditions which will give rise to simple sampling in practice. Supposing, for 
example, that we observe among groups of 1,000 persons, at different times 
or in different localities, the various percentages of individuals possessing 
certain characteristics — dark hair, or blindness, or insanity, and so forth. 
Under what conditions should we expect the observed percentages to 
obey the law of sampling that we have found, and show a standard 
deviation given by equation (17.2) ? 

17.20 In the first place, the condition that p, the probability of drawing 
an individual with attribute A on random sampling, remains constant, 
and in particular is the same for all samples, means that the proportion 
of individuals with attribute A in the population must remain constant 
at the drawing of each sample. Consequently, if formula (17.2) is to 
hold good in our practical case of sampling there must not be a difference 
in any essential respect — i.e. in any character that can affect the proportion 
observed — between the localities from which the samples are drawn, nor, 
if the samples have been made at different epochs, must any essential 
change have taken place during the period over which the observations 
are spread. Where the causation of the character observed is more or 
less unknown, it may, of course, be difficult or impossible to say what 
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differences or changes are to be regarded as essential, but where we have 
more knowledge the condition laid down enables us to exclude certain 
cases at once from the possible applications of formula (17.1) or (17.2). 
Thus it is obvious that the theory of simple sampling cannot apply to the 
variations of the death-rate in localities with populations of different 
age and sex composition, or to death-rates in a mixture of healthy and 
unhealthy districts, or to death-rates in successive years during a period 
of continuously improving sanitation. In all such cases variations due 
to definite causes are superposed on the fluctuations of sampling. 


17.21 Secondly, the proportion of individuals with attribute A must 
remain constant for the drawing of each individual member of the sample. 
This is again a very marked limitation. To revert to the case of death- 
rates, formulae (17.1) and (17.2) would not apply to the numbers of persons 
dying in a series of samples of 1 ,000 persons, even if these samples were all 
of the same age and sex composition, and living under the same sanitary; 
conditions, unless, further, each sample only contained persons of one sex 
and one age. For if each sample included persons of both sexes and 
different ages, the condition would be broken, the chance of death during 
a given period not being the same for the two sexes, or for the young 
and the old. The groups would not be homogeneous in the sense required 
by the conditions from which our formula have been deduced. 


17.22 We pointed out in 17.3 that sampling from a finite population 
is not simple owing to the fact that the abstraction of an individual alters 
the chance of success at the next trial. In practice there are three 
important cases in which the condition for the constancy of p is satisfied : 

{a) If the individuals are replaced at each drawing before the next 
drawing is made ; for in this case the constitution of the population is the 
same at each trial, and hence the chance of success must also be the same. 

(b) If the population is infinite ; for in this case the withdrawal of a 
finite number of members does not affect the proportion of individuals in 
the population possessing the attribute in question. 

(c) If the population is very large, p may be taken to be constant with- 
out sensible error, provided that the sample is not also large. This is a 
very important case, and justifies the application of the theory of simple 
sampling to many practical data. 

Suppose, for instance, we are sampling the population of the United 
Kingdom for sex ratio, and decide to take a sample of 1,000. Suppose 
again, for the purposes of illustration, that the whole population consists 
of 23 million women and 22 million men. The chance of getting a man at 

22 000 000 

the first trial wiU then be 45 ^ 000 ^’ succeed in getting a man, 

21 999 999 

the chance of doing so at the second trial will be Even if we 

draw 999 men the chance of success at the thousandth trial would be 
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21.999,001 


All these chances, to a dose approximation, are equal, and we 


44,999.001 

can assume them to be so without fear of appreciable error. The case 
would, of course, have stood differently if our sample had numbered several 
millions. 


17.23 A third condition for simple sampling was explicitly stated in 
our definition in 17.3. The individual events must be completely in- 
dependent of one another, like the throws of a die, or sensibly so, like the 
drawing of balls from a bag containing a number of balls which is large 
compared with the number drawn. Reverting to the illustration oi a 
death-rate, our formulae would not apply even if the sample populations 
were composed of persons of one age and one sex, if we were dealing, »r 
example, with deaths from an infectious or contagious diseaise. For if one 
person in a certain sample has contracted the diseaise in question, he h^ 
increased the possibility of others doing so, and hence of dying from thfe 
'^4isoaise. The same thing holds good for certain claisses of deaths from 
accident, e.g. railway accidents due to derailment, and explosions in mines : 
if such am accident is fatal to one person it is probably fatal to others also, 
and consequently the annual returns show large and more or less erratic 
variations. 


17J24 It is evident that these conditions very much limit the field of 
practical cases of an economic or sociological character to which formulae 
(17.1) and (17.2) can apply without considerable modification. The 
formulae appear, however, to hold to a high degree of approximation in 
certain biological cases, notably in the proportions of offspring of different 
types obtained on crossing hybrids, and, with some limitations, to the 
proportions of the two sexes at birth. It is possible, accordingly, that in 
these cases all the necessary conditions are fulfilled, but this is not a 
necessary inference from the mere applicability of the formulae. In thg. 
case of the sex ratio at birth it seems doubtful whether the rule applies to 
the frequency of the sexes in individual families of given numbers, but it 
does apply fairly closely to the sex ratios of births in different localities, 
and still more closely to the ratios in one locality during successive periods. 
That is to say, if we note the number of males in a series of groups of 
H birt hs each, the standard deviation of that number is approximately 
Vnpiq, where p is the chance of a male birth ; or, otherwise, Vpq]n is the 
standard deviation of the proportion of male births. 

AppHcations of simple sampling 

17.2S We have already shown in examples how the theory of simple 
sampling can be used to gauge the precision of an estimate of the proportion 
of individuals in a population which possess an attribute A , and to set limits 
outside which that proportion probably does not lie. We now turn to 
further applications of the theory in the checking and control of the 
faiterpretation of statistical results. 
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173S6 Case 1. — Given the expected frequency in a sample and the 
observed frequency of successes, it is desired to know whether the deviation 
of the second from the first can have arisen from fluctuations of simple 
sampling. 

This is a case which we have discussed in Examples 17.3 and 17.4. 
From the expected frequency we can calculate the standard error, and if 
the deviation is more than 3 times this quantity it almost certainly did not 
arise from fluctuations of random sampling. 

17^7 One caution is necessary here. If the deviation is less than 
3 times the standard error, it does not follow that the expected frequency 
divided by the number in the sample is really the proportion of individuals 
possessing the attribute A in the population. In other words, if the 
expected value is derived from some hypothesis, such as the Mendelian 
hj^thesis in the case of Example 17.4, the fact that the deviation lies 
within the limits of 3 times the standard error does not prove the h 3 q)othesis 
correct. It only indicates that experiment and hypothesis are not in 
disagreement. Furthermore, if the deviation lay without those limits, 
the hypothesis would not necessarily be disproved, for the fault might 
lie with the randomness of the sampling. 

17.28 Case 2. — Two samples from distinct materials or different popula- 
tions give proportions of ^I’s and p^, the numbers of observations in 
the samples being % and «, respectively, {a) Can the difference between 
the two proportions have arisen merely as a fluctuation of simple sampling, 
the two populations being really similar as regards the proportion of X’s 
therein ? (i) If the difference indicated were a real one, might it vanish, 
owing to fluctuations of sampling, in other samples taken in precisely the 
same way ? This case corresponds to the testing of an association which is 
indicated by a comparison of the proportion of .<4’s amongst B’s and /Sr’s. 

(a) We have no theoretical expectation in this case as to the proportion 
of A’s in the population from which either sample has been taken. 

Let us find, however, whether the observed difference between pi and 
Pi may not have arisen solely as a fluctuation of simple sampling, the 
proportion of A ’s being really the same in both cases, and given, let us say, 
by the (weighted) mean proportion in our two samples together, i.e. by 

(the best guide that we have). 

Let $ 1 , Si be the standard errors in the two samples, then 

Cl* = PdSo «!* =“ Mo /»t 

If the samples are simple samples in the sense of the previous work* then 
the mean difference between and p^ wiU be zero, and the standard team 
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of the difference the samples being independent, will be given by 

.... (17.4) 

If the observed difference is less than some three times it may have 
arisen as a fluctuation of simple sampling only. 

(6) If, on the other hand, the proportions of A*s are not the same in the 
material from which the two samples are drawn, but and p^ are the true 
values of the proportions, the standard errors of sampling in the two c^ses 
are 

^ PlQx 

and consequently \ 

ej, (17.5) 


If the difference between p^ and p^ does not exceed some three times 
this value of ejg, it may be obliterated by an error of simple sampling on 
taking fresh samples in the same way from the same material. 

The student will note that in arriving at these results we have assumed 
that the unknown values p^, py^, p^ are given to a sufficient degree of 
approximation by estimates from the samples. This, as we have seen, is 
justified if n be large. 

Example 17.7. — (Data from J. Gray, " Memoir on the Pigmentation 
Survey of Scotland," Jour, of the Royal Anthropological Institute, 1907, 
37). The following are extracted from the tables relating to hair-colour 
of girls at Edinburgh and Glasgow — 


Edinburgh 

Glasgow 


Of medium Total Per cent 

hair-colour observed medium 

4,008 9,743 41 ■ 1 

17,529 39,764 44-1 


Can the difference observed in the percentage of girls of medium hair- 
colour have arisen solely through fluctuations of sampling ? 

In the two towns together the percentage of girls with medium hair- 
colour is 43-5 per cent. If this were the true percentage, the standard 
error of sampling for the difference between percentages observed in 
samples of the above sizes would be — 

.... (43.5X56.5).. 

= 0*56 per cent. 


The actual difference is 3*0 per cent, or over 5 times this, and could not 
have arisen through the chances of simple sampling. 



THE SAMPLING OF ATTRIBUTES 


399 


If we assume that the difference is a real one and calculate the standard 
error by equation (17.5), we arrive at the same value, viz, 0-56 per cent. 
With such large samples the difference could not, accordingly, be 
obliterated by the fluctuations of simple sampling alone, 

17.29 Case 3. — Two samples are drawn from distinct material or different 
populations, as in the last case, giving proportions of A*s and but 
in lieu of comparing the proportion p^ with p^ it is compared with the 
proportion of A*s in the two samples together, viz. p^, where, as before, 


4, 


Required to find whether the difference between p^ and p^ can have arisen 
as a fluctuation of simple sampling, p^ being the true proportion of A's 
in both samples. 

This case corresponds to the testing of an association which is indicated 
by a comparison of the proportion of .4's amongst the with the pro- 
portion of A*s in the population. The general treatment is similar to that 
of Case 2, but the work is complicated owing to the fact that errors in 
p^ and pQ are not independent. 

If % be the standard error of the difference between p^ and p^, we 
have at once — 


^01 — 




■ -+i-2. 




01 






fj, being the correlation between errors of simple sampling in and p^. 
But from the above equation relating p^ to pi and writing it in terms 
of deviations in p^, />, and />,. multiplying by the deviation in p^ and 
summing, we have, since errors in />, and p^ are uncorrelated — 

f = 

n,+n,'eo V^i+w* 

Therefore finally — 


e* =_M_. !?* 
“ n^+w, 


( 17 . 6 ) 


Unless the difference between and p^ exceed, say, some three times 
this value of it may have arisen solely by the chances of simple 
sampling. 

It will be observed that if be very small compared with n,, 
approaches, as it should, the standard error for a sample of observations. 

We omit, in this case, the allied problem whether, if the differmioe 
between pi and indicated by the samples were real, it might be wiped 
out in other samples of the same size by fluctuations of dmple 
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alone. The solution is a little complex, as we no longer have 
eo*==Mo /(»!+«*)• 

Example 17.8. — ^Taking now the figures of Example 17.7, suppose 
that we had compared the proportion of girls of medium hair-colour in 
Edinburgh with the proportion in Glasgow and Edinburgh together. 
The former is 41 • 1 per cent, the latter 43 • 5 per cent, difference 2 • 4 per cent. 
The standard error of the difference between the percentages observed in 
the sub-sample of 9,743 observations and the entire sample of 49,507 
observations is, therefore, 

/ 39,764 \* 

eo,={43.5 x 56.5)*(^g^5Q^3^g^j =0-45 per cent. 

The actual difference is over five times this (the ratio must, of course, be 
the same as in Example 17.7), and could not have occurred as a mere 
error of sampling. 

Effect of removing the limitations of single sampling 

17.30 Let us now consider the effect on the standard error of the removal 

of the conditions of simple sampling which we discussed in 17.19 to 17.24. 

The breakdown of the condition we discussed in 17.20, namely, that 
the proportion of A’s in the population should remain constant for all 
samples, might occur if we took a number of samples from a changing 
population or from different strata of a population which was not homo- 
geneous. 

We may represent such circumstances in a case of artificial chance by 
supposing that for the first throws of n dice the chance of success for 
each die is ^j, for the next /, throws for the next /, throws and so 
on, the chance of success varying from time to time, just as the chance 
of death, even for individuals of the same age and sex, varies from district 
to district. Suppose, now, that the records of all these throws are pooled 
together. The mean number of successes per throw of the n dice is given 
by 


where N^Z{f) is the whole number of throws, and”7^, is the mean value 
^ifp) IN of the varying chance^. To find the standard deviation of the 
number of successes at each throw, consider that the first set of throws 
contributes to the sum of the squares of deviations an amount 

*Pi9i being the square of the standard deviation for these throws, and 
MPi'^Ptl the difference between the mean number of successes for the 
irrt set and the mean iox all the sets together. Henoe the sta&derd 
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deviation o' of the whole distribution is given by the sum of all quantities 
like the above, or 

Nc» = nL(fpq)+n>!L{f{p-p^)»} 

Let o, be the standard de'viation of p, then the last sum is Nn*<r,*, and 
substituting 1 —p for q, we have — 

a* = no,*+«*o,* 

= • (17.7) 

This is the formula corresponding to equation (17.1) ; if we deal with 
the standard deviation of the proportion of successes, instead of that of 
the absolute number, we have, dividing through by «*, the formula 
corrfesponding to equation (17.2), viz. — 

(17^) 

17.31 If n be large and Sj be the standard error calculated from the 
mean proportion of successes p^, equation (17.8) is sensibly of the form 

s* = *0*+®** 

We have thus analysed s® into two parts, s#* the portion due to devia- 
tions from the mean p^, and a,* the portion due to variations of the p’s 
about their mean. The former we may regard as the contribution to 
s* due to chance fluctuations ; the latter as the contribution due to real 
variation of the proportions among the different strata of the population. 

In conformity with later work we shall continue to call s (or a if we 
are dealing with frequencies) the standard error, although the sampling 
is no longer simple. The deviation $ is still, in fact, the standard deviation 
of the various sample values of p about the mean value. The term 
Sg (or Vnp^a), on the other hand, is what the standard error would have 
been if the sampling had been simple, and from the above equation we 
accordingly see that the effect of the breakdown of the first condition for 
simple sampling is to increase the standard error. 

We may illustrate the effect of variations in p on the data of Table 17.1, 
showing the percentages of the electorate voting in municipal elections 
in England, in various groups according to size of electorate. (The 
figures in the original returns for percentages are given to the first place 
of decimals, so the intervals are centred at 20-45, 27-45, etc.) 

At the foot of the table we show the actual variances s* and the 
theoretical variances based on the formula Pqin. For instance, in the 
size group 0 — 5,000 we have p == 0-5621 and take » as the mid-point 
of the range, namely 2,500. The variance (in terms of percentages, not 
proportions) is then (0-5621 x 0-4379 x 100*) /2,500 =« 0-98. 

Now it is clear from these data that the theoretical variances are only 
a very small proportion of the actual variances. In short we cannot 

N* 
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assume, even in electorates of about the same size, that the numbers 
voting are distributed in the binomial form. There is, so to speak, no 
"proneness to vote" common to all electors and represented by the 
proportion p. There are (as we know for elections) substantial variations 
between electorates, represented by the variances s^—SqK 
The effect of these results on " straw votes " for the forecasting of 
elections is evident. We cannot measure the standard error of proportions 
in samples of persons indicating their voting intentions by the simple- 
sampling formulae. 

TABLE 17.1. — Percentages of electorate voting in municipal elections in England in 1945 

County boroughs and boroughs with more than 100,000 voters omitted. Electorate 
includes only those persons entitled to vote on this occasion, i.e., persons in non- 

contested areas are excluded. ^ 

Data from Registrar-Generars Review of England and Wales for 1946, Tables Part II Civil. 


Percentage of electorate 
voting 

0 

to 

5,000 

Size of electorate 

5,001 10,001 15,001 

to to to 

10,000 15,000 20,000 

20,001 

to 

50,000 

50.001 

to 

100,000 

20 - 



1 

2 

— 

1 

1 

25 - 

3 

6 

2 

2 

5 

3 

30 - 

10 

17 

12 

9 

16 

5 

35 - 

20 

18 

13 

14 

20 

9 

40 - 

40 

44 

31 

10 

31 

10 

45 - 

39 

44 

32 

9 

33 

3 

50 - 

82 

39 

26 

14 

25 

1 

55 - 

77 

54 

21 

9 

17 

— 

60 - 

72 

31 

12 

6 

6 

— 

65 - 

42 

12 

6 

5 

2 

— 

70 - 

32 

5 

3 

— 

— 

— 

75 - 

12 

1 

2 

— 

— 

— 

80 - 

3 



— 







85 - 

1 

1 

— 

— 

— 



90 - 

— 

— 

— 

1 

— 

— . 

Totals . 

433 

272 

162 

79 

156 

32 

Means .... 

56-21 

50*51 

48-81 

47-83 

45-56 

39-79 

Variances 5*. 

120-12 

113-45 

in -43 

140-36 

82-80 

85-91 

Theoretical variances Sn* 

0-98 

0-33 

0-20 

0-14 

0-07 

0-03 

V(s’-V) • 

10-9 

10-6 

10-5 

11-9 

9-1 

9-3 


The figures of this case also bring out clearly one important consequence 
of (17.8), viz. that if we make n large, s becomes sensibly equal to o,, 
while if we make n small, s becomes more nearly equal to /«. Hence, 
if we want to know the significant standard deviation of the proportion p 
— the measure of its fluctuation owing to definite causes — » should be 
made as large as possible ; if, on the other hand, we want to obtain good 
illustrations of the theory of simple sampling, n should be made small. 
If « be very large, the actual standard error may evidently become almost 
indefinitely large compared with the standard deviation of simple sampling. 
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Thus during the twenty years 1855-74 the death-rate in England and Wales 
fluctuated round a mean value of 22*2 per thousand with a standard 
deviation ($) of 0-86. Taking the mean population as roughly 21 millions, 
the standard deviation of simple sampling (sq) is approximately 

= 0*032 per thousand 

V21xl0« ^ 

This is only about one twenty-seventh of the actual value. 

17.32 Now consider the effect of altering the second condition of simple 
sampling dealt with in 17.21, viz. the circumstances that regulate the 
appearance of the character observed shall be the same for every in- 
dividual or every sub-class in each of the populations from which samples 
are drawn. Suppose that in a group of n dice thrown the chances for 
Wj dice are ; for m.^ dice, 2 tnd so on, the chances varying for 

different dice, but being constant throughout the experiment. The case 
differs from the last, as in that the chances were the same for every die 
at any one throw, but varied from one throw to another ; now they are 
constant from throw to throw, but differ from one die to another as they 
would in any ordinary set of badly made dice. Required to find the effect 
of these differing chances. 

For the mean number of successes we evidently have — 

M fn^p^+m2pi+mip^+ . . . 


Pq being the mean chance jn. To find the standard deviation of the 
number of successes at each throw, it should be noted that this may be 
regarded as made up of the number of successes in the Wj dice for which the 
chances are pi,qi, together with the number of successes amongst the m, 
dice for which the chances are Pt»9i, and so on ; and these numbers of 
successes are all independent. Hence, 

a* = mj>^qi+fn2p2qi+nt^9q9+ . . . 

= 

Substituting 1 —p for q, as before, and using a, to denote the standard 
deviation of p, 

ff® — np^^—no,^ .... (17.9) 

or if s be, as before, the standard error of the proportion of successes, 


n i 


. ( 17 . 10 ) 


Hence, in this case the standard error s is less than the standard error 
of simple sampling. 
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17.33 The extent to which the standard error is affected may con- 

ceivably be considerable. To take a limiting case, if ^ be zero for half the 
events and unity for the remainder, and <7,— so that s is zero. 

To take another illustration, still somewhat extreme, if the values of p 
are uniformly distributed over the whole range between 0 and 1, 

as before, birt o,*=l /12=0‘0833 (6.15, p. 136). Hence, s*=0-1667/», 
s =0 • 408 jVn, instead of 0 • 5/ the value of s if the chances are ^ in every 
case. In most practical cases, however, the effect will be much less. Thus 
the standard deviation of simple sampling for a death-rate.of, say, 14 wr 
thousand in a population of uniform age and one sex is (14 x 986)*/v» 
=1181 \/ir. In a population of the age composition of that of England 
and W^es, however, the death-rate is not, of course, uniform, but varim 
from a high value in infancy (say 64 per thousand), through very low 
values (2 to 3 per thousand) in childhood to continuously increasing valued 
in old age : the standard deviation of the rate within such a population 
is roughly about 24 per thousand. But the effect of this variation on the 
standard deviation of simple sampling is quite small, for, as calculated from 
equation (17.10), 

s* = ^(14 x 986-576) 

s = llS/y/n 

as compared with 118/v'«. 

17.34 We have, finally, to pass to the condition referred to in 17.23, 
and to discus^ the effect of a certain amount of dependence between the 
several “ events ” in each sample. We shall suppose, however, that the 
two other conditions are fulfilled, the chances p and q being the same for 
every event at every trial, and constant throughout the experiment. The 
standard deviation for each event is {pq)l as before, but the events are no 
longer independent ; instead, therefore, of the simple expression 

a* = npq 


we must have (cf. 14.2, p. 327) 

a» = . . . +r„-f . . . ) 

where fu, etc. are the correlations between the results of the first and 
second, first and third events, and so on— correlations for variables (nnihber 
of successes) which can only take the values 0 and 1, but may neverthe- 
less be treated as ordinary variables. There are »(»— 1) /2 corrdation 
coefficients, and if, therefore, r is the arithmetic mean of the correlations, 
we may write— 


a* = np^l +»’(«— 1)] 


. ( 17 . 11 ) 
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The standard deviation of simple sampling will therefore be increased or 
diminished according as the average correlation between the results of 
the single events is positive or negative, and the effect may be considerable, 
as o may be reduced to zero or increased to For the standard 

deviation of the proportion of successes in each sample we have the 
equation 

s* =$[l+>’(«-l)] (17.12) 

17.35 It should be noted that, as the means and standard deviations 
for our variables are all identical, r is the correlation coefficient for a table 
formed by taking all possible pairs of results in the n events of each sample. 

It should also be noted that the case when r is positive covers the 
departure from the rules of simple sampling discussed in 17.30-17.31 ; 
for if we draw successive samples from different records, this introduces 
the positive correlation at once, even although the results of the events at 
each trial are quite independent of one another. Similarly, the case dis- 
cussed in 17.32-17.33 is covered by the case when r is negative ; for if 
the chances are not the same for every event at each trial, and the chance 
of success for some one event is above the average, the mean chance of 
success for the remainder must be below it. The present case is, however, 
best kept distinct from the other two, since a positive or negative correlation 
may arise for reasons quite different from those discussed in 17.30-17.33. 

17.36 As a simple illustration, consider the important case of sampling 
from a limited population, e.g. of drawing n balls in succession from the 
whole number ip in a bag containing pw white balls and (pe black balls. 
On repeating such drawings a large number of times, we are evidently 
equally likely to get a white hall or a black ball for the first, second or Nth 
b^ of the sample ; the correlation table formed from all possible pairs of 
every sample will therefore tend in the long run to give just the same form 
of distribution as the correlation table formed from all possible pairs of 
the w balls in the bag. But from 11.41, page 276, we know that the 
correlation coefficient for this table is —1 /(ip— 1), whence 



If Nsi, we have the obviously correct result that <f=(pq)*, as in draw- 
ing from unlimited material ; if, on the other hand, n^w, 9 becomes zero 
as it should, and the formula is thus checked for simple cases. For draw- 
ing 2 balls out of 4, a becomes 0'816 (npq)^ ; for <kawing 5 balls out of 
10, 0*745 {n;^)* ; in the case of drawing half the balls out of a very large 
muaber, it api«x>xiraatfls to (0*5«^g}t, or 0’707(«^)*. 
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17.37 In the case of contagious or infectious diseases, or of certain 
forms of accident that are apt, if fatal at all, to result in wholesale deaths, 
r is positive, and if n be large (as it usually is in such cases), a very small 
value of Y may easily lead to a very great increase in the observed standard 
deviation. It is difficult to give a really good example from actual 
statistics, as the conditions are hardly ever constant from one year to 
another, but the following will serve to illustrate the point. During the 
twenty years 1887-1906 there were 2,107 deaths from explosions of fire- 
damp or coal-dust in the coal-mines of the United Kingdom, or an average 
of 105 deaths per annum. From 17.15 it follows that this should be the 
square of the standard deviation of simple sampling, or the standard 
deviation itself approximately 10-3. But the square of the actual 
standard deviation (the standard error) is 7,178, or its value 84*7, the 
numbers of deaths ranging between 14 (in 1903) and 317 (in 1894). Th^ 
large standard deviation, to judge from the figures, is partly, though no^ 
wholly, due to a general tendency to decrease in the numbers of deaths 
from explosions in spite of a large increase in the number of persons 
employed ; but even if we ignore this, the magnitude of the standard 
deviation can be accounted for by a very small value of the correlation r , 
expressive of the fact that if an explosion is sufficiently serious to be fatal 
to one individual, it will probably be fatal to others also. For if Oq denote 
the standard deviation of simple sampling, a the standard deviation of 
sampling given by equation (17.11), we have — 

(«-“l)CTo* 

Whence, from the above data, taking the numbers of persons employed 
underground at a rough average of 560,000, 


7,073 

560,00()‘x 105 


+ 0-00012 


17.38 Summarising the preceding paragraphs, 17.30-17.37, we see that 
if the chances p and q differ for the various populations, districts, years, 
materials, or whatever they may be from whicfi the samples are drawn, 
the standard deviation observed (the standard error) will be greater than the 
standard deviation of simple sampling, as calculated from the average values 
of the chances ; if the average chances are the same “"for each population 
from which a sample is drawn, but vary from individual to individual or 
from one sub-class to another within the population, the standard deviation 
observed (the standard error) will be less than the standard deviation of 
simple sampling as calculated from the mean values of the chances ; finally, 
if p and q are constant, but the events are no longer independent, the 
observed standard deviation (the standard error) will be greater or less 
than the simplest theoretical value according as the correlation between 
the results of the single events is positive or negative. These conclusions 
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further emphasise the need for caution in the use of standard errors. If we 
find that the standard deviation in some case of sampling exceeds the 
standard deviation of simple sampling, two interpretations are possible : 
either that p and q are different in the various populations from which 
samples have been drawn (i.e. that the variations are more or less signifi- 
cant), or that the results of the events are positively correlated inter se. 
If the actual standard deviation fall short of the standard deviation of 
simple sampling two interpretations are again possible : either that the 
chances^ and q vary for different individuals or sub-classes in each popula- 
tion, while approximately constant from one population to another, or 
that the results of the events are negatively correlated inter se. Even if 
the actual standard deviation approaches closely to the standard deviation 
of simple sampling, it is only a conjectural and not a necessary inference 
that all the conditions of “ simple sampling ’’ are fulfilled. Possibly, for 
example, there may be a positive correlation r between the results of the 
different events, masked by a variation of the chances p and q in sub- 
classes of each population. 

An alternative approach 

17.39 The results of this chaptei have been studied from a rather different 
point of view by a continent^ school of statisticians, among whose names 
those of Lexis and Charlier are prominent. 

Lexis considers a number of samples of n individuals in which the 
proportions of successes observed are p^, pt, . . . pg, and sets himself 
to investigate the nature of the population from which they were drawn — 
whether it is homogeneous and the samples may be regarded as obtained 
by simple sampling, whether it varies in time or place so that the samples 
are not simple, and so on. He takes ^ to be the mean of the observed 
values px • • • Pn> ai'd writes — 

r = 0-67449^^ 

He then defines 

R = 0-67449^(?^^i* 

where the summation extends over all values oi pi . . . p^, and writes 



17.40 Now, if the sampling is simple we may, in large samples, take 
the mean p to be an estimate of the true value, and r to be an estimate of 
the probable error of simple sampling of Also, we may take the quantity 
R to be an estimate of the probable error of p (see 21.7). 

Hence, for large samples, R is approximately equal to r, and ^=1. 
This case, which is what we have called simple sampling, Lexis calls 
** normal dispersion." 
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17.41 On the other hand, if the population is not constant while the 
samples are drawn, or if they come from different parts of a patchy popula- 
tion, we get the case discussed in 17.30. R is no longer an estimate of the 
probable error of a constant p, but may be split into two parts, one due to 
the sampling fluctuations of the observed values of p round the mean value, 
the other due to the variations of the true values roimd that mean. R will 
therefore be greater than r, as may be seen from equation (17.8), and 
Q>1. This case Lexis calls " supernormal dispersion.” 

17.42 Similarly, in the case discussed in 17.32 we get R less than r, 
and hence ^<1. This case Lexis calls “subnormal dispersion,” and 
speaks of the data which give rise to it as “ constrained ” (gebundene). \ 

The quantity Q is analogous to a quantity x*> which we shall consider 
at some length in Chapter 20 in discussing the significance of the deviations 
of observed frequencies from theoretical expectation. 

SUMMARY 

1. Under simple sampling conditions, the proportion of successes in a 
sample may be taken as an estimate of the proportion of successes in the 
parent population. 

2. If p is the proportion of successes in the population, the standard error 
of simple sampling of the number of successes is given by 

a — ^npq 

and of the proportion of successes by 



\ *» 


3. The probability that an observed number of successes deviates fromr 
the expected number by more than three times the standard error is v%ry 
small. This fact enables us to set limits to the range within which the 
observed frequency lies when we know the theoretical frequency. 

4. For large samples, the observed frequency of successes may be used 
to calculate the standard error, and this fact enables us to set limits to 
the range within which the theoretical frequency lies_when we know the 
observed frequency. 

5. For several samples, if the chance of success varies from sample to 
sample but remains constant within a sample, the standard error of the 
number of successes is given by 

o* *= «/>o?o+«(»— !)»/,* 
and of the proportion of successes by 

n n • 
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where is the mean of the varying chance of success, a, is the standard 
deviation of p, and n is the number of individuals in each sample. 

If n is large and Sg is the standard deviation calculated from the mean 
pQ, this last equation is approximately 

s* — So*+<r»* 

6. If the chance of success varies between the individuals of a sample 
but does not vary as between the different samples, 

a* = np„q^-na,* 


s* 


n n 

7. If the chance of success remains constant for each member of each 
sample, but the events are not independent, 

o* = 1)} 

where r is the mean of the correlations between the results of the events. 


EXERCISES 

1 7.1 Compare the actual with the theoretical mean and standard deviation 
for the following record of 6,500 throws of 12 dice, 4, 5 or 6 being reckoned 
as a " success " — 


Successes 

Frequency 

Successes 

Frequency 

0 

1 

7 

1,351 

1 

14 

8 

844 

2 

103 

9 

391 

3 

302 

10 

117 

4 

711 

11 

21 

5 

1,231 

12 

3 

6 

1,411 

Total 

6,500 


17.2 (Quetelet, " Lettres . . . sur la th6orie des probaldlit^s.") 

Balls were drawn from a bag containing equal numbers of bladr and white 
balls, each ball being returned before drawing another. The records wwe 
then grouped by counting the number of black balls in consecutive 2*8, 
S's, 4's, S's, etc. The following are the distributitme so doived for 
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grouping by 5's, 6's, and 7*s. Compare actual with theoretical means 
and standard deviations. 


Successes 

(a) Grouping 
by fives 

(b) Grouping 
by sixes 

(c) Grouping 
by sevens 

0 

30 

17 

9 

1 

125 

65 

34 

2 

277 

166 

104 

3 

224 

192 

151 

4 

136 

166 

148 

5 

27 

69 

95 

6 

— 

8 

40 

7 

— 

— 

4 

Total 

819 

683 

585 


17.3 The proportion of successes in the data of Exercise 17.1 is 0*5097. 
Find the standard deviation of the proportion with the given number of 
throws, and state whether you would regard the excess of successes as 
probably significant of bias in the dice. 

17.4 In the 4,096 drawings on which Exercise 17.2 is based 2,030 balls 
were black and 2,066 white. Is this divergence probably significant of 
bias ? 

17.5 (Data from Report I, Evolution Committee of the Royal Society, 
page 17.) In breeding certain stocks, 408 hairy and 126 glabrous plants 
were obtained. If the expectation is one-fourth glabrous, is the divergence 
significant, or might it have occurred as a fluctuation of sampling ? 

17.6 400 eggs are taken at random from a large consignment, and 50 are 
found to be bad. Estimate the percentage of bad eggs in the consignment 
and assign limits within which the percentage probably lies. 

17.7 In a certain association table (data from Exercise 2.5) the following 
frequencies were obtained — 

(AB) = 309, {Afi) = 214, {aB) = 132, (afi) = 119 


Can the association of the table have arisen as a fluctuation of simple 
sampling, the true association being zero ? » 


17.8 The sex ratio at birth is sometimes given by the ratio of male to 
female births, instead of the proportion of male to total births. If Z is 
the ratio, i.e. Z^p jq, show that the standard error of Z is approximately 


the mean. 


n being large, so that deviations are small compared with 


17.9 In a random sample of 500 persons from town A, 200 are found 
to be consumers of cheese. In a sample of 400 from town B, 200 arc also 
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found to be consumers of cheese. Discuss the question whether the 
data reveal a significant difierence between A and B so far as the propor- 
tion of cheese-consumers is concerned. 

17.10 In a newspaper article of 1,600 words in English 36 per cent of 
the words are found to be of Anglo-Saxon origin. Assuming that simple 
sampling conditions hold, estimate the proportion of Anglo-Saxon words 
in the writer’s vocabulary and assign limits to that proportion. 

Suggest possible causes which might break down the three conditions 
for simple sampling. 

17.11 If a series of random samples of different sizes is taken from the 
same material, show that the standard deviation of the observed propor- 
tions of successes in such sets is s, where 



and H is the harmonic mean of the numbers in the samples. 

17.12 Apply the result of the previous exercise to the following data 
(A. D. Darbishire, Biometrika, vol. 3, page 30), giving percentages to the 
nearest unit of albinos obtained in 121 litters from hybrids of Japanese 
waltzing mice by albinos, crossed inter se — 


Percentage 

Frequency 

Percentage 

Frequency 

0 

40 

40 

3 

14 

4 

43 

2 

17 

9 

50 

16 

20 

9 

57 

1 

22 

1 

60 

3 

25 

10 

67 

4 

29 

3 

80 

1 

33 

13 

100 

2 


Calculate the actual standard deviation and compare it with the result 
given by the formula of the previous exercise. The expected proportion 
of albinos is 25 per cent, and the sizes of the litters are given in Example 
5.5, page 121 

17.13 In a case of mice-breeding (see reference above) the harmonic mean 
number in a litter was 4*735, and the expected proportion of albinos 50 
per cent. Find the standard deviation of simple sampling for the inropor- 
tion of albinos in a litter, and state whether the actual standard deviation 
(21 *63 per cent) probably indicates any red variation, or not. 

17.14 If for one half of n events the chance of success is p and the chance 
of failure q, whilst for the other half the chance of success is q and the 
chance of failure p, what is the standard deviation of the number of 
successes, the events being all independent ? 
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17.15 Corresponding to the case of equation (17.8) show that if the values 
of ^ are small so that the binomial tends to the Poisson limit with parameto’ 
M, the variance of the numbers of successes observed is given by 

S*=Af+or5 

where M is the mean value of M and Ojf is the standard deviation. 

17.16 Similarly, corresponding to equation (17.10), show that 

s* = M 

so that the usual equation for the standard error holds notwithstanding 
departures from simple sampling of the type here considered. (ci(. 
equation (17.3)). 

17.17 The following are the deaths from smallpox during the twentj^ 
years 1882-1901 in England and Wales — 


1882 

1,317 

1892 

431 

83 

957 

93 

1,457 

84 

2,234 

94 

820 

85 

2,827 

95 

223 

86 

275 

96 

541 

87 

506 

97 

25 

88 

1,026 

98 

253 

89 

23 

99 

174 

90 

16 

1900 

85 

91 

49 

1901 

356 


The death-rate from smallpox being very smeiU, the rule of 17.15 may 
be applied to estimate the standard deviation of simple sampling. Assum- 
ing that the excess of the actual standard deviation over this can be 
entirely accounted for by a correlation between the results of exposure^, 
to risk of the individuals composing the population, estimate r. The 
mean population during the period may be taken in round numbers as 
29 millions. 



CHAPTER EIGHTEEN 


THE SAMPLING OF VARIABLES 

LARGE SAMPLES 


Sampling of variables 

18.1 We are now able to proceed from the sampling of attributes to 
the sampling of variables. Whereas in the last chapter we were interested 
in the question whether a member of a sample did or did not exhibit a 
particular attribute, we now have to study individuals which may take any 
of the values of a variable. It will no longer be possible, therefore, for us 
to classify each member of a sample under one of two heads, success or 
failure ; in general the values of the variate given by different trials wUi 
be spread over a range, which may be unlimited, limited by practical 
considerations, as in the case of height in human beings, or limited by 
theoretical considerations, as in the case of the correlation coefficient, 
which cannot lie outside the range +1 to —1. 

18.2 To give concreteness to our discussions we shall occasionally find 
it useful to consider the sampling of variables as a kind of ticket sampling. 
We may picture our population as made up of. tickets, each bearing a 
recorded value of some variable X. Sampling may then be imagined to 
consist of the drawing of tickets and the noting of the values of X which 
they bear. In the great majority «f cases with which we shall deal, X 
may have any value over a continuous range, and the ticket population 
is to be conceived as being actually or practically infinite. 

18.3 As in the case of attributes, our principal objects in studying 
these samples wOl be {a) to compare observation with expectation and to 
see how far deviations of one from the other can be attributed to fiuctua- 
tions of sampling ; (6) to estimate from samples some characteristic of the 
parent population, such as the mean of a variate ; and (c) to gauge the 
reliability of our estimates. 

In order to grasp satisfactorily the ideas and assumptions upon whidi 
work of this kind is based, it is necessary to develop some theoretical 
considerations which have already been touched upon in the last chapter. 
This we now proceed to do, 

fi f tn pHfig lUltlflNltlonS 

18.4 If we take a numbo* of $mnpl«i from a population and cakutale 

4»S 
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some function,^ such as the mean or the standard deviation, of each sample, 
we shall in general get a series of different values, one for each sample. If 
the number of samples is at all large, these values may be grouped in a 
frequency distribution ; and as the number of samples becomes larger, . 
this distribution will approach the “ ideal " form of a continuous curve. 
Such a distribution is called a sampling distribution. 


As an illustration, consider the population of 8,585 men, classified 
according to height, of Table 4.7, page 82. In Chapter 16 we show^ed 
how to draw a random sample of 10 individuals from this population, 
and for one sample we calculated the mean. The following table shows 
the 100 values of the sample mean obtained by taking 100 such samples 
arranged in the form of a frequency table — \ 

TABLE 18.1. — Frequency distribution of means of samples of 10 from the populatioA 
of the last column of Table 4.7 page 82 


Value of mean in 
sample (inches) 
less ^ inch 

Number of samples with 
specified values of 
the mean 

64*4-- 

1 

64*8- 

— 

65*2- 

1 

65-8- 

11 

66-0- 

12 

66*4- 

16 

66-8- 

22 

67*2- 

18 

67*6- 

14 

68* 0- 

4 

68-4- 

1 

Total 

100 


This distribution is not very regular, owing to the smallness of the total 
frequency. 

18.6 As a second illustration we take some data obtained with random 
sampling numbers from a bivariate normal population with correlation 
+0*9. 500 samples of 10 were taken and the correlation coefficient 
of each sample worked out. The frequency distribution of the 500 values 
was as follows (data adapted from P. R. Rider, “ Distribution of Correla- 
tion Coefficient in Small Samples,'* Biometrika, vol. 24, 1932, page 382)— 


^ Quantities such as means, standard deviations, moments, correlation coefficients 
and so forth will be referred to generically as " parameters.** It is the modem practice 
to reserve this word for a population value and to denote the corre^onding sample 
value by the word “Statistic.*’ Thus a sample-mean is a statistic which forms the 
sstiinate of a population-mean, the parameter. 
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TABLE 18.2.*-~Fr€qiieiicy 4litribtttioii of conreUitloii ooeffidents In samples of 10 from 

a noniial popolatUm 


Value of r in sample 

Frequency 

-0*1-00 

2 

OO-O-l 

0 

01-0-2 

0 

0-2-0-3 

2 

0-3-0-4 

4 

0*4-0*5 

7 

0-5-0-6 

30 

o*e-o*7 

44 

0-7-0-8 

102 

0*8-09 

178 

0*9-10 

131 

Total 

500 


Here the distribution is more regular, the number of samples being five 
times as large. In general we expect that as the number of samples 
increases, the distribution will tend more and more to a continuous curve. 

Use of the sampling distribution 

18.7 Let us suppose that we are given the sampling distribution of a 
statistic, and that the frequency {y) may be represented in terms of 
the variate {x) by a continuous curve, 

•The frequency with which a given value of the statistic occurs in 
a large number of samples will be represented by the ordinate of the 
curve at the point whose abscissa is Xg. We have had an example of 
this in the normal curve. 

The number of samples which give a value of x greater than x„ will be 
represented by the area to the right of the ordinate at *(, ; the number 
giving a value less than x, will be represented by the remaining area to 
the left. 

Hence, the chance that any sample chosen at random from all posable 
samples will give a value of x greater than x, is given by the area to the 
right of the ordinate at x, divided by the total area of the curve, vdiich 
represents the total number of samples ; and the chance that the sample 
wiU give a value of x less than x, is given by the area to the left of the 
ordinate of x, divided by the total area. 

Similarly, the chance that a sample would give a value of x lying 
between, say, Xj and x, is the area lying between the ordinates at the pc^ts 
X| and X, divided by the total area. 
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18.8 In 8.21 we referred to the fact that areas could be expressed in 
the notation of the integral calculus. In fact, we may write the area 
of the curve between and x^ as 

r‘/(x)dx 

J*i 

and hence we may express P, the probability that a sample will give a 
value between x^ and x,, as 

P = j*y j j_^/ 

where we assume the extreme limits to be ± oo as in the normal curv^ 
In particular, the probability that the sample will give a value of x greate^ 
than X, is given by 



As a rule, we can choose our units so that the area of the curve is unity. 
This simplifies the above expressions ; for the denominator, being equal 
to unity, may be omitted. 

18.9 Now let us suppose that, knowing the form of the sampling distribu- 
tion and hence being able to calculate P for any given Xg, we take a 
sample and find that it gives a very low value of P. We are then faced 
with three possibilities : either a very improbable event has occurred ; 
or the assiunptions on which we obtained the samphng distribution were 
incorrect ; or there is something wrong with our sampling technique. * 
Which of these explanations we adopt is to some extent a matter of choice, 
but if we have tested our sampling, or on other grotmds have no reason 
to suspect it, we shall, as a rule, be led to query the hypotheses on which 
the sampling distribution was obtained. 

This, in effect, is what we did in the previous chapter. It so happens 
that in the simple sampling of attributes we know that the exact form 
of the sampling distribution is iV(g+^)", where p is the chance of success. 
Without examining this distribution too closely we can say that only a 
very small part of it lies outside the range ±3ff. Hence, if we find a 
sample giving a value outside the range ±3V*^, we suspect the hypothesis 
on which the distribution was based ; and this, unless we prefer to suppose 
that our sampling was not in fact simple, leads us to suspect the value of 
p, which completely determines the sampling distribution. 

lUO In the previous diapter we regarded the probability of a sample 
giving a value differing by more than 2a from the mean value as so fmnote 
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that in every case we should be justified in looking for some definite 
cause of the discrepancy. This is only a conventional range, based upon 
the empirical fact that in most single-humped populations it includes 
nearly all the members ; but it is a convenient one to take and we shall 
use it again below. For certain purposes, however, we might be prepared 
to use a narrower range which, though not giving such a small probability 
that a sample lay outside it, yet indicated considerable improbability in 
the divergence of observation from expectation, and enabled us to criticise 
the validity of our hypotheses with some degree of assurance. We give 
one or two examples below. 

18.11 In practice nearly all the sampling distributions we have to 
consider are based on simple sampling. It is therefore convenient to 
speak briefly of a “ sampling distribution/' meaning thereby a sampling 
distribution obtained under simple (and random) conditions. 

Example 18.1. — ^The sampling distribution of a statistic is a normal 
population with mean 9 units and standard deviation 2 units. What is 
the probability that a sample will give a value of the statistic greater 
than 12 units ? 

Here the value 12 is three units, i.e. l'5a, to the right of the mean. 
The required probability is therefore the area of the normal curve to the 
right of an ordinate 1 *50 to the right of the mean, divided by the total 
area of the curve. 

This ratio can be obtained at once from Table 2 of the Appendix. 
We see, in fact, that the greater fraction of the area of the curve cone- 

X 

spending to ~ =1 *5 is 0-9332. The smaller fraction is therefore 0*0688, 

which gives us the required probability. 

Example 18.2. — If the sampling distribution of a statistic is normal, 
with zero mean and standard deviation a, what is the value of the sta- 
tistic such that the chances are 99 to 1 against a sample giving a value 
in excess of that value ? 

We have to find x such that the area of the curve to the right of the 
ordinate at x is 0*01. or the area to the left 0-99. 

From Appendix Table 2 — 

If -=2*32, greater fraction of area»0*9898 
o 

andif-==2'33 „ „ „ =s0-9901 

a 

Hence, to the nearest second place of dedmals the required value is 2*339. 

Example 18.3. — ^It very frequently happens in samiding inquiries 
that we are interested in the probability that a sample vdne exceeds a 
given value x, in iMnU vahe, le. that it is greatw *b»" «, or Jest 
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— We can ascertain this probability without much trouble from the 
ordinary table of areas of the normal curve if the distribution is normal. 
Consider, for instance, the data of Example 18.1. Here we found the 
probability that a sample would give a value greater than l-So. If we 
want the probability that it would give a value greater than l*5o in 
absolute value, we have — 

P = Area to right of ordinate at 1 - So 
+ Area to left of ordinate at —1 - So 

Since the curve is symmetrical, the two areas in question are equal, an|^ 

P =2(1-0-9332) 

= 0-1336 \ 

18.12 To apply the results of 18.7 to 18.11 in practice for the purpose 
of discussing the population from which the samples came, we require to 
know two things : (a) What is the relation between the sampling dis- 
tribution and the parent distribution, and (b) what is the form, at least 
approximately, of the sampling distribution of a given statistic from a 
given population ? 

18.13 If the sampling is to be of much use in enabling us to estimate 
the value of a parameter in the parent, we should expect most of our 
estimates to be somewhere near the mark, and only comparatively few to 
be very far from the true value of the quantity estimated ; and further, we 
expect that, in general, the further the estimates are from the truth the 
fewer there will be of them. 

To put this more formally, we expect that the sampling distribution 
will have a peak somewhere close to the value of the parameter which 
corresponds to the true value in the parent. If it does not, the distribution 
is probably biased and our samples are likely to be misleading. 

The first desideratum in our sampling is, therefore, that it shall not lead 
to a biased distribution. We have seen in Chapter 16 the difficulties of 
eliminating bias in the sampling process itself. Where, therefore, the more 
practical considerations alluded to in that chapter impose no limitation, 
we must use unbiased sampling ; and this means tharour sampling must 
be random. In this connection it must be remembered that we cannot 
. judge from the samples themselves whether the sampling is random or not, 
though we may suspect it. Separate tests, or the use of some accredited 
method, are to be recommended where practicable. 

18.14 Knowledge of the form of the sampling distribution of. a statistic, 
even of an approximate kind, is by no means easy to secure. We saw that 
in the case of the simple sampling of attributes it was possible to deduce 
the sampling distribution in an exact form. We are not always in this 
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fortunate position here — in fact, rarely so. The principal difficulties 
are — 

(а) The form of the parent population frequently is unknown. 

(б) Even if the form of the parent is known, certain of its constants may 
be unknown ; for instance, we may know that a population is normal but 
be ignorant of its mean and standard deviation. 

(c) If the parent is completely known, the form of the sampling dis- 
tribution can be deduced theoretically in certain circumstances, and in 
particular if the sampling is simple ; but in practice the mathematical 
problems which arise usually are very complex, and even if they are 
tractable may be of no use owing to the enormous arithmetical labour 
involved in expressing a solution in serviceable form. 

18.15 If the samples are small these difficulties are formidable, even 
for simple sampling. With large samples, however, we are able to make 
certain legitimate approximations and assumptions which greatly simplify 
the problem. For the rest of this chapter and in the next we shall be 
concerned solely with large samples. 

Simple sampling of variables 

18.16 We shall also be thinking mainly in terms of simple sampling 
(17.3). It is unnecessary to recapitulate here the discussion of simple 
sampling which we gave in the previous chapter. The assumptions which 
we considered in 17.19 to 17.24 apply muiatis mutandis to the simple 
sampling of variables. 

(а) We assume that we are drawing from precisely the same record 
during the whole of the sampling ; if we picture our parent population 
as a card population, the chance of drawing a card with any given value 
X is the same for each sample. 

(б) We assume not only that we are drawing from the same record 
throughout, but that each of our cards at each drawing may be regarded 
quite strictly as drawn from the same record (or from identically similar 
records) : e.g. if our card record is contained in a series of bundles, we must 
not make it a practice to take the first card from bundle number 1, the 
second card from bundle number 2, and so on, or else the chance of drawing 
a card with a given value of X, or a value within assigned limits, may not 
be the same for each individual card at each drawing. 

(c) We assume that the drawing of each card is entirely independent 
of that of every other, so that the value of X recorded on card 1, at each 
drawing, is uncorrelated with the value of X recorded on card 2, 3, 4, and 
so on. It is for this reason that we spoke of the record, in 18.2, as contain- 
ing a practically infinite number of cards, for otherwise the successive 
drawings at each sampling would not be independent ; if the bag contains 
ten tickets only, bearing the numbers 1 to 10, and we draw the card bearing 
1 , the average of the following cards drawn will be higher than the nwan of 
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all cards drawn ; if, on the other hand, we draw the 10, the average pf the 
following cards will be lower than the mean of all cards — i.e. there will be 
a negative correlation between the number on the card taken at any one 
drawing and the card taken at any other drawing. Without making the 
number of cards in the bag indefinitely large, we can, as already pointed out 
for the case of attributes, eliminate this correlation by replacing each card 
before drawing the next. 

Approximations in the theory of large samples 

18.17 We can now consider the approximations which are possible in 

the theory of large samples. \ 

In the first place, since we have supposed bias to be eliminated, tlK 
sample values of a statistic will be grouped about the true value, and 
if the samples are large, will differ by comparatively small quantities 
from that value. Hence, we may take a sample value as an estimate' 
of the true value. That is to say, if we have a large sample (which may 
consist of a number of samples run together), we may calculate the para- 
meter from it precisely as we should proceed if we were calculating the 
parameter for the population as a whole, and take that value as our 
estimate. Thus, the mean of the sample may be taken as an estimate 
of the mean of the population. 

18.18 This rule is not quite so obvious as it appears. Suppose, for 
example, that we are estimating the standard deviation of a population. 
In accordance with the previous paragraph we should take the standard 
deviation of the sample. But in calculating this quantity we should have 
to use deviations, not from the true mean, but from the mean in the sample, 
which may differ from the true mean and to that extent affect the value 
of the estimate. We shall, in fact, see later that if x^, Xg ... x„ are the 
values in the sample and x their mean, there are reasons for preferring 

the estimate s*= x)® to the estimate s*=-S(x— i?)* for the 

variance. If « is large, however, the difference is unimportant ; we can 
ignore it imtil we come to deal with small samples. 

18.19 Secondly, as in the case of attributes, we can-Use these estimates 
in calculating the constants of the sampling distribution, since they 
differ only by small quantities from the real values. We saw, for instance, 
that we were justified in taking the value of ^ in a large sample in 
calculating the standard deviation Vnpq of the sampling distribution. 
We shall find that the standard deviation of the samphng distribution of 
the mean of samples from a normal population involves the standard 
deviation of the parent ; and in this case we can evaluate that quantity 
by using the standard deviation of the sample in place of the unknown 
standard deviation of the parent. 
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18^ Finally, it is a very remarkable fact that the sampling distribati<»it 
of many statistics, obtained under simple sampling conditions, tend 
for large samples to a single-humped form either exactly or very closely 
normal. The evidence for this statement is partly theoretical, partly 
experimental. It may be shown that, for simple samples from a normal 
population, the sampling distributions of most statistics are exactly • 
normal for large samples — some, in fact, are normal for small samples. 
Following up this work, a number of experiments has been carried out on 
populations which are not normal ; and it appears that the parent can 
deviate quite markedly from the normal form without affecting the nor- 
mality of the sampling distribution to any great extent provided, as before, 
that the samples are large. 

In most of our work we shall not require to assume that the sampling 
distribution is normal. It will be sufficient to assume that a range of 3a 
on each side of the mean includes the major portion of the distribution, 
and we can confidently take this to be so unless the parent exhibits very 
marked skewness. 

18.21 It will now be apparent that the difficulties we specified in 18.14 
have to a great extent been met. Provided that we know the parent 
distribution to be not unduly skew, we need not know its exact form ; 
and the sampling distribution can be represented satisfactorily, if not 
exactly specified, by a mean and standard deviation which may be 
estimated from the data of the sample. 

Standard error 

18.22 As in the last chapter, we shall refer to the standard deviation 
of the sampling distribution as the standard error. In most cases we 
fihall be dealing with simple sampling distributions, but it is convenient 
to use the term in this wider sense, although the word “ error ” is not 
altogether appropriate in some instances. In general, as we have seen, 
we are justified in taking a range of ± 3 times the standard error as deter- 
mining limits outside which the value of the parameter given by a sample 
probably does not lie. We can therefore use the standard error, as we 
have already used it for attributes, to gauge the precision of an estimate 
or to permit a judgment being made of the divergence between expected 
and observed values. 

In the remainder of this chapter, and in the next, we shall therefore 
be concerned mainly in finding expressions for the standard errors of 
the various parameters which we have to estimate. Their use we shall 
illustrate in examples as we go along. In certain cases we shall also 
consider the effect of a breakdown in the conditions of simple sampling. 

Standard tA tnoi <rf a quantile, quartOe and me d ian 

18.23 Let us first of all consider the case of quantiles, wWch isintimfitdy 

related to that of attnhtttes. 
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Consider the distribution of a variate X in an indefinitely large sample. 
(This is not necessarily the same as the distribution in the parent, owing 
to the possible presence of bias ; but if bias is excluded, and the sampling 
is simple, it is the same as the parent form.) 

Let Xp be a value of X such that pN values of A" in this distribution 
lie above it and qN below it. Thus, if the sampling is unbiased, 
would give us the upper decile in the indefinitely large sample, /'—i the 
median, and so on. 

A sample of n will contain various values of X. Let the proportion 
of values above Xj, he p+S ; and let e be the adjustment to be made in 
Xp so that the proportion of values of X above ATp+e is p. The valuk 
S and c may be regarded as sampling fluctuations. ' 

Considering now the sample of w, we have that 

the proportion of values above Xp = /> 

»> tt »» A p -f € = p 

Hence, 

S = proportion of values between Xp and Xp+e 

Now if n be large, the proportion of values between Xp and Ap+e in 
the sample will, to a close approximation, be the proportion of values 
between those quantities in the distribution of an indefinitely large 
sample. Consider then this distribution and let the standard deviation 
of A in it be a. If we take the distribution as drawn to scale with unit 
standard deviation and unit area, the proportion of values between Ap 

A 

and Ap+€ is the area of the curve between ordinates at the points — 

and-*'+' 


Now if n be large, e wiU be small, for the value of a parameter in the 
sample of n will lie close to the value in the indefinitely large sample. 

A A I c 

Hence the area between and — is approximately rectangular, and 
A 

if we call the — ordinate Vp, the area will be y«x — 

O’ ^ 0 

Hence, 






THE SAMPLING OF VARIABLES 


4*3 


Now S is the deviation of the observed proportion from the value p ; 
and from our study of attributes we know that the observed proportion 

P 8 will centre round the mean p with standard deviation V?- 
Hence d centres round zero mean with standard deviation Since 


a 

e bears a constant ratio — to d, it follows that c will be distributed about 
zero mean with standard deviation 


Vvar (x,) = CT,„= 


1 It^ 


(18.1) 


18.24 If the distribution in an indefinitely large sample be normal, 
we can take the values of y, from the tables of the ordinate of the normal 
curve (Appendix Table 1). From tables carried to further places of 
decimals v'e have, for the various values of p which correspond to the 
deciles, 


Median 

Deciles 4 and 6 

„ 3 and 7 

„ 2 and 8 

„ 1 and 9 

Quartiles 


Value of yp 
. 0*3989423 

. 0-3863425 

. 0-3476926 

. 0-2799619 
. 0-1754983 

. 0-3177766 


Inserting these values of y, in equation (18.1), we have the following 
values for the standard errors of the median, deciles, etc. — 


Standard error is 
<r/^n multiplied by 


Median .... 1-25331 

Deciles 4 and 6 ... 1 -26804 

,. 3and7 . . 1-31800 

„ 2and8 . 1-42877 

,. Iand9 . . . 1-70942 

Quartiles .... 1-36263 


It will be seen that the influence of fluctuations of sampling on the 
several quantiles increases as we depart from the median : the standard 
error of the quartiles is nearly one-tenth greater than that of the median, 
and the standard error of the first or ninth decile more than one-third 
greater. 

18.JK Consider further the influence of the form of the frequency- 
distribution on the standard error of the median, as this is an important 
form of average. For a distribution with a given number of observations 
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and a given standard deviation the standard error varies inversely as yp. 
Hence for a distribution in which yp is small, for example a U-shaped 
distribution, the standard error of the median will be relatively high, and 
it will, in so far, be an undesirable form of average to employ. On the 
other hand, in the case of a distribution which has a high peak in the 
ceatre, so as to exhibit a value of yp large compared with the standard 
deviation, the standard error of the median will be relatively low. We 
can create such a “ peaked ” distribution by superposing a normal curve 
with a small standard deviation on a normal curve with the same mean 
and a relatively large standard deviation. To give some idea of tne 
reduction in the standard error of the median that may be effected by\ a 
moderate change in the form of the distribution, let us find for what 
ratio of the standard deviations of two such curves, having the same area, 
the standard error of the median reduces to a/Vn^, where a is of course 
the standard deviation of the compound distribution. ■ 

Let Oj, aj be the standard deviations of the two distributions, and let 
there be n /2 observations in each. Then 






(18.2) 


On the other hand, the value of yp is 


1-C+— ui /' 

l2V2wai 2V2n(j2iy 


CT,*+cr,* 


Hence, the standard error of the median is 


n Cl +CTj 


(18.4) is equal to a fVn if 

2 V ffOjcr, 

and writing u*/<ri=p, that is if 

il±p)yJ±E'^i 

2Vnp 


p«+2/)*-f(2— 4»r)/)*-(-2p-f-l *=0 

This equation may be reduced to a quadratic and solved by taking 

p+l as a new variable. The roots found give p = 2*2360 ... or 
P 

0*4472 . . . , the one root being merely the ret^lpmcal of the other. The 
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standard error of the median will therefore be a /Vn, in such a compound 
distribution, if the standard deviation of the one normal curve is, in round 
numbers, about times that of the other. If the ratio be greater, the 
standard error of the median will be less than a/Vn. The distribution 
for which the standard error of the median is exactly equal to cr jVn is 
shown in fig. 18.1 ; it will be seen that it is by no means a very striking 
form of distribution ; at a hasty glance it might almost be taken as normal. 
In the case of distributions of a form more or less similar to that shown, 
it is evident that we cannot at all safely estimate by eye alone the relative 
standard error of the median as compared with a / 


18.26 In the case of a grouped frequency-distribution in which the 
number of observations is large enough to give a fairly smooth distribution, 
we can use a alternative form which does not involve a knowledge of the 
standard deviation of the distribution in a very large sample. In fact, in 
such a case the sample itself is large enough to give us a satisfactory 
approximation to the distribution in an indefinitely large sample. Let fp 
be the frequency per class- interval at the given percentile — simple inter- 
polation will give us the value with quite sufficient accuracy for practical 

purposes, and if the figures 
run irregularly they may 
be smoothed. Let a be 
the value of the stan- 
dard deviation expressed in 
class-intervals, and let n 
be the number of obser- 
vations as before. Then, 
since yp is the ordinate of 
the frequency-distribution 
when drawn with unit 
standard deviation and unit 
area, we must have 
a . 

yp^-fp 

fl 

But this gives at once for 
the standard error expressed 
in terms of the class-interval 
« l as unit 



CTxp = 


Vnpq 

~Tp~ 


(18.5) 


Example 18.4. — Consid^ the data of Table 4.7, page 82, giving the 
distribution of 8,585 men 'according to height. Let us take these data to 
be a sample from the population of men in the United Kii^dom at that 


o 
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time. The number of observations is 8,585, and the standard deviation 
2*57 in., the distribution being approximately normal : a 027737, 
and, multiplying by the factor 1 -253 . . . given in the table in 18,24, this 
gives 0*0348 as the standard error of the median, on the assumption of 
normality of the distribution. 

Using the direct method of equation (18.5), we find the median to be 
67*47 (5.20), w^hich is very nearly at the centre of the interval with a 
frequency 1,329. Taking this as being, with sufficient accuracy for our 
present purpose, the frequency per interval at the median, the stanc|ard 
error is 


V^5 

1329 


0 0349 


\ 


\ 


As we should expect, the value is practically the same as that obtained 
from the value of the standard deviation on the assumption of normality. 

Three times the standard error is 0* 1047, and we accordingly conclude 
that the median in the population lies within about 0* 1 inch of 67*47, the 
sample value, provided that the sampling is simple. 


Example 18.5. — Let us find the standard error of the first and ninth 
deciles as another illustration. On the assumption that the distribution 
is normal, these standard errors are the same, and equal to 0*027737 
Xl •70942 =0*0474. Using the direct method, we find by simple inter- 
polation the approximate frequencies per interval at the first and ninth 
deciles respectively to be 590 and 570, giving standard errors of 0*0471 
and 0*0488, mean 0*0479, slightly in excess of that found on the assump- 
tion that the frequency is given by the normal curve. The student should 
notice that the class-interval is, in this case, identical with the unit of 
measurement, and consequently the answer given by equation (I8.5),ddfes 
not require to be multiplied by the magnitude of the interval. 


Correlation between errors of quantiles 

18.27 In finding the standard error of the difference between two quantiles 
in the same distribution, the student must be careful to note that the 
errors in two such quantiles are not independent. Consider the two 
quantiles for which tlie values of p and q are p^ ql, p^ respectively, 
the first named being the lower of the two quantiles. These two quantiles 
divide the whole area of the frequency curve into three parts, the areas 
of which are proportional to Pi* Further, since the 

errors in the first quantile are directly proportional to the errors in jj, 
and the errors in the second quantile are directly proportional but of 
opposite sign to the errors in p^, the correlation between errors in the tw'o 
quantiles will be the same as the correlation between errors in q^ and p^» 
but of opposite sign. But if there be a deficiency of observations below the 
lower quantile, producing an error 8^ in q^, the missing obsei^atioiis will 
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tend to be spread over the two other sections of the carve in proportion to . 
their respective areas, ^ and will therefore tend to produce an error 


St 




in ^ 2 * then, r be the correlation between errors in and and €r 
the respective standard errors, we have — 



Or, inserting the values of the standard errors, 


IPz^i 

V q.J>, 

The correlation between the quantiles is the same in magnitude but 
opposite in sign ; it is obviously positive, and consequently 


Correlation between errors 
in two quantiles 

If the two quantiles approach very close together, q^ and ^2 Pv ^^P% 
become sensibly equal to one another, and the correlation becomes unity, 
as we should expect. An alternative derivation is suggested in 19.3. 


. /> 2?1 


(18.6) 


Standard eixor of semi-interquartile rang6 

18.28 Let us apply the above value of the correlation between quantiles 
to find the standard error of the semi-interquartile range for the normal 
curve. Insertingyi=/>,=J, wefindf=^. Hence the standard 

error of the interquartile range is, applying the ordinary formula for the 
standard deviation of a difference, times the standard error of 

either quartile, or the standard error of the sm»-interquartile range 
1 /V3 times the standard error of a quartile. Taking the value of the 
standard error of a quartile from the table in 18.24, we have, finally. 


Standard error of the semi- 
interquartile range in a 
normal distribution 


0-78672;^ 


(18.7) 


Of coarse the standard deviation of the interquartile, or semi-inter- 
quartile, range can readily be worked out in any particular case, using 
equation (18.5) and the value of the correlation given above ; it is best to 
work out such standard errors from first principles, applying the usual 
formula for the standard deviation of the difference of two correlated 
variables (14.2). 

‘ This statemeat is, perhaps, notobvionsly true, and the assun^oa which it n^eseats 
is not a necessary conation for the validity of equation ( 18 . 6 ). The altemative spproach 
of 1841 avoids using it. ' 
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18.29 If there is any failure of the conditions of simple sampling, the 
formulae of the preceding sections cease, of course, to hold good. We 
need not, however, enter again into a discussion of the effect of removing 
the several restrictions, for the effect on the standard error of p was con- 
sidered in detail in Chapter 17, and the standard error of any quantile is 
directly proportional to the standard error of p. 

Standard error of the arithmetic mean 

18.30 Let us now determine the standard error of the arithmetic mian. 
Suppose we note separately at each drawing the value recorded on! the 

first, second, third . . . and nth card of our sample. The standard deviamon 
of the values on each separate card will tend in the long run to be the 
same, and indentical with the standard deviation a of in an indefinitely 
large sample, drawn under the same conditions. Further, the value 
recorded on each card is (as we assume) uncorrelated with that on every 
other. The standard deviation of the sum of the values recorded on the 
n cards is therefore \/na, and the standard deviation of the mean of the 
sample is consequently 1 /tith of this ; or, 



This is a most important and frequently cited formula, and the student 
should note that it has been obtained without any reference to the size of 
the sample or to the form of the frequency-distribution. It is therefore 
of perfectly general application, if a be known. We can verify it against 
our formula for the standard deviation of sampling in the case of attributes. 
The standard deviation of the number of successes in a sample of m observjji- 
tions is \/mpq ; the standard deviation of the total number of successes 
in n samples of m observations each is therefore \/nmpq : dividing by n we 
have the standard deviation of the mean number of successes in the n 
samples, viz. ^/mj^ jy/n, agreeing with equation (18.8). 

Example 18.6. — In the height distribution considered in Examples 18.4 
and 18.5 we found that a/\/n==0*0277 approximately. This is then 
the standard error of the mean of the distribution. 

If we regard the data as a simple sample from the population of men in 
the United Kingdom, we may take the mean, i.e. 67*46 inches, as an 
estimate of the mean in the population. Three times the standard error 
is very small, 0*083 inch, and we can therefore locate the mean in the 
population with considerable accuracy. 

The standard error in this case, however, gives a misleading idea as 
to the accuracy attained in determining the average stature in the United 
Kingdom ; the sample was not chosen under conditions which gave every 
individual an equal chance of being chosen. 
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Comparison of the standard errors of the median and the mean 

18.31 For a normal curve the standard error of the mean is to the 
standard error of the median approximately as 100 to 125 (cf. 18 . 24 ), 
and in general the standard errors of the two stand in a somewhat similar 
ratio for a distribution not differing largely from the normal form. For 
the distribution of statures used as an illustration in Example 18.4, the 
standard error of the median was found to be 0 • 0349 : the standard error 
of the mean is only 0*0277. The distribution being very approximately 
normal, the ratio of the two standard errors, viz. 1-26, assumes almost 
exactly the theoretical magnitude. 

As such cases as these seem on the whole to be more common and 
typical, we stated in 5.23 that the mean is m general less affected than 
the median by errors of sampling. At the same time we also indicated the 
exceptional cases in which the median might be the more stable — cases in 
which the mean might, for example, be affected considerably by small 
groups of widely outlying observations, or in which the frequency-distribu- 
tion assumed a form resembling fig. 18.1, but even more exaggerated 
as regards the height of the central peak " and the relative length of 
the '' tails."' Such distributions are not uncommon in some economic 
statistics, and they might be expected to characterise some forms of ex- 
perimental error. If, in these cases, the greater stability of the median 
is sufficiently marked to outweigh its disadvantages in other respects, the 
median may be the better form of average to use. Fig. 18.1 represents 
a distribution in which the standard errors of the mean and of the median 
are the same. Further, in some experimental cases it is conceivable that 
the median may be less affected by definite experimental errors, the average 
of which does not tend to be zero, than is the mean — this is, of course, a 
point quite distinct from that of errors of sampling. 

Means of two samples 

18.32 When we have two samples from some record which exhibit 
different means, a very common question which we wish to ask is : Can 
the difference be accounted for by sampling fluctuations, i.e. can the two 
samples have come from the same population ? 

If the two samples are independent and come from the same population 
under simple conditions, evidently Cjj, the standard error of the difference 
of their means, is given by 



If an observed difference exceed three times the value of given by 
this formula, it can hardly be ascribed to fluctuations of sampling. If, in 
a practical case, the value of a is not known a priori, we must substitute 
an observed value, and it would seem natural to take as this value the 
standard deviation in the two samples thrown together. If, however, the 
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Standard deviations of the two samples themselves difier more than can 
be accounted for on the basis of fluctuations of sampling alone (see below, 
19.14), we evidently cannot assume that both samples have been drawn 
from the same record : the one sample must have been drawn from a 
record or a population exhibiting a greater standard deviation than the 
other. If two samples be drawn quite independently from different 
populations, indefinitely large samples from which exhibit the standard 
deviations Oj and Og, the standard error of the difference of their means 
will be given by 


, a? of 

e?, = 


. {isM 


This is, indeed, the formula usuadly employed for testing the significantif: 
of the difference between two means in any case ; seeing that the standard 
error of the mean depends on the standard deviation only, and not on the 
mean, of the distribution, we can inquire whether the two populations 
from which samples have been drawn differ in mean apart from any difference 
in dispersion. 


18.33 If two quite independent samples be drawn from the same popula- 
tion, but instead of comparing the mean of the one with the mean of the 
other we compare the mean of the first with the mean of both 
samples together, the use of (18.9) or (18.10) is not justified, for errors 
in the mean of the one sample are correlated with errors in the mean 
of the two together. Following precisely the lines of the similar problem 
in 17.29, we find that this correlation is V”i7(^ +« 2 )» hence 


e 


1 

01 


«iK+«2) 


. (18.11) 


Effect on standard error of mean of breakdown of conditions for simple 
sampling 

18.34 Let us consider briefly the effect on the standard error of the 
mean if the conditions of simple sampling as laid down in 18.16 cease 
to apply. 

If we do not draw from the same record all the time, but first draw a 
series of samples from one record, then another series from another record 
with a somewhat different mean and standard deviation, and so on, or if 
we draw the successive samples from essentially different parts of the same 
record, the standard error will be greatly increased. 

For suppose we draw ky samples from the first record, for which the 
standard deviation (in an indefinitely large sample) is Og, and the mean 
differs by dg from the mean of all the records together (as ascertained by 
large samples in numbers proportionate to those now taken), samples 
from the second record, for which the standard deviation is Og, and the 
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mean diilers by from the mean of all the records together, and so on. 
Then for the samples drawn from the first record the standard error of the 
mean will be but the distribution will centre round a value difiering 

by di from the mean for all the records together ; and so on for the samples 
drawn from the other records. Hence, if a* be the standard error of the 
mean in all the records taken together, N the total number of samples. 


But the standard deviation Oj tbe records together is given by 


Hence, writing 'L{kd*) = Ns„^, 


2 _ 


ft ft 


. (18.12) 


This equation corresponds precisely to equation (17.8), page 401. The 
standard error of the mean, if our samples are drawn from different records 
or from essentially diiferent parts of the entire record may be increased 
indefinitely as compared with the value it would have in the case of 
simple sampling. If, for example, we take the statures of samples of 
n men in a number of different districts of England, and the standard 
deviation of all the statures observed is Oq, the standard deviation of the 
means for the different districts will not be Oo/Vn, but will have some 
greater value, dependent on the real variation in mean stature from 
district to district. 

18.35 If we are drawing from the same record throughout, but always 
draw the first card from one part of that record, the second card from 
another part, and so on, and these parts differ more or less, the standard 
error of the mean will be decreased. For if, in large samples drawn from 
the subsidiary parts of the record from which the several cards are taken, 
the standard deviations are a,, Oj, . . . and the means differ by 
from the mean for a large sample from the entire record, 

we have — 

Hence, 



n n 


. ( 1843 ) 
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The last equation again corresponds precisely with that given for the 
same departure from the rules of simple sampling in the case of attributes 
(equation (17.10), page 403). If, to vary our previous illustration, we 
had measured the statures of men in each of n different districts, and 
then proceeded to form a set of samples by taking one man from each 
district for the first sample, one man from each district for the second 
sample, and so on, the standard deviation of the means of the samples 
so formed would be appreciably less than the standard error of simple 
sampling Oq/Vw. As a limiting case, it is evident that if the men in each 
district were all of precisely the same stature, the means of all the samples 
so compounded would be identical ; in such a case, in fact, <Jo= 5^, and 
consequently a^=0. To give another illustration, if the cards from wh^ch 
we were drawing samples had been arranged in order of the magnitude \of 
X recorded on each, we would get a much more stable sample by drawling 
one card from each successive wth part of the record than by taking tlie 
sample according to our previous rules — e.g. shaking them up in a bag 
and taking out cards blindfold, or using some equivalent process. 

The result is perhaps of some practical interest. It shows that, if we 
are actually taking samples from a large area, different districts of which 
exhibit markedly different means for the variable under consideration, and 
are limited to a sample of n observations, if we break up the whole area 
into n sub-districts, each as homogeneous as possible, and take a contribu- 
tion to the sample from each, we will obtain a more stable mean by this 
orderly procedure than will be given, for the same number of observations, 
by any process of selecting the districts from which samples shall be taken 
by chance. There may, however, be a greater risk cf biased error. These 
conclusions seem in accord with common sense. We consider this subject 
further in Chapter 23. 

18.36 Finally, suppose that, while our conditions (a) and (6) of 18.16 
hold good, the magnitude of the variable recorded on one card drawn 
is no longer independent of the magnitude recorded on another card, 
e.g. that if the first card drawn at any sampling bears a high value, the next 
and following cards of the same sample are likely to bear high values also. 
In these circumstances, if denote the correlation between the values 
on the first and second cards, and so on, 

^ +2^j(r,2+rx3+ . . . +^83+ . . .) 

ft ft* 

There are h(« — l)/2 correlations; and if, therefore, r is the arithmetic 
mean of them all, we may write — 

As the means and standard deviations of x* are all identical. 



THE SAMPLING OF VARIABLES 


433 


Y may more simply be regarded as the correlation coefficient for a table 
formed by taking all possible pairs of the n values in every sample. If this 
correlation be positive, the standard error of the mean wDl be increased, 
and for a given value of r the increase will be the greater, the greater the 
size of the samples. If r be negative, on the other hand, the standard error 
will be diminished. Equation (18.14) corresponds precisely to equation 
(17.12), page 405. 

As was pointed out in 17.35, the case when r is positive covers the 
case discussed in 18.34 ; for if we draw successive samples from different 
records, such a positive correlation is at once introduced, although the 
drawings of the several cards at each sampling are quite independent of 
one another. Similarly, the case discussed in 18.35 is covered by the case 
of negative correlation, for if each card is always drawn from a separate 
and distinct part of the record, the correlation between any two will 
on the average be negative ; if some one card be always drawn from a part 
of the record containing low values of the variable, the others must on an 
average be drawn from parts containing relatively high values. It is as 
well, however, to keep the three cases distinct, since a positive or negative 
correlation may arise for reasons quite different from those considered in 
18.34 and 18.35. 


SUMMARY 

1. A knowledge of the sampling distribution of a statistic enables us 
to ascertain the probability that a given sample will exhibit a value of the 
statistic between specified limits. 

2. The sampling distribution of many statistics tends to the normal 
form, or at least a single-humped form, for large values of n, the number in 
the sample, if the sampling is simple. 

3. This fact enables us to take a range of ±3 times the standard error 
as providing limits within which a sample value of the statistic will 
probably lie ; with the fuither assumption of normality of the sampling 
distribution we can determine the probability that a sample value will lie 
within any specified limits. 

4. In a large sample the values of statistics in the sample may be 
taken to be estimates of the values in the population, if the sample is 
simple. Further, these values may be used instead of the values in the 
population in calculating the standard errors of the statistics. 

5. The standard error of the median of a normal distribution is given by 

s.e. = 1 - 25331 - 4 = 

Vn 

where a is the standard deviation in an indefinitely large sample and n 
is the number in the sample. 
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6. With the same notation the standard error of the arithmetic mean is 


a 



whatever the form of the distribution. 

7. If a series of samples of n is drawn from different populations or from 
different parts of a non-homogeneous population, 


where is the standard error of the mean, Oq is the standard deviation 
in all the samples taken together, and 5^ is the standard deviation \ of 
means of indefinitely large samples about the mean of all samples. \ 
8, If samples are drawn so that each member comes from a different 
section of a non-homogeneous population. 


o 


2 

m 


Si 


2 


n n 


where <1^* ^-re defined as before. 

9. If there is a correlation between the results of the drawing of succes- 
sive individuals, 

[I 1)] 

fl 


where is the standard error of the mean, a the standard deviation in 
an indefinitely large sample, and r is the mean correlation between the 
results of pairs of individuals. 


EXERCISES 

18.1 If the sampling distribution of a statistic is normal, find the 
probability that a sample value will differ from the central value by more 
than twice the probable error. 

18.2 In the height distribution of the United Kingdom given in Table 
4.7, page 82, assumed to be normal, with mean 67*46 inches and standard 
deviation 2*57 inches, find the probability that an individual chosen in 
the same way as the members of the distribution will be between 5 and 6 
feet in height. 

18.3 For the data of the last column of Exercise 4.6, page 100, find the 
standard error of the median (154*7 lbs.) and the standard errors of the 
two quartiles (142*5 lbs. and 168*4 lbs.) 

18.4 For the same distribution find the standard error of the semi-inter- 
quartile range. 
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18.5 The standard deviation of the same distribution is 21 *3 lbs. Find 
the standard error of the mean and compare it with the standard error 
of the median (Exercise 18.3). 

18.6 Taking the values of the median and the quartiles of the marriage 
distribution of Table 4.8, page 84, from Example 7.8, page 100, find their 
standard errors. 

18.7 In the same distribution the mean is 29*4 years and the standard 
deviation 8 years, approximately. Find the standard error of the mean 
and compare it with that of the median. 

18.8 For the same distribution find the standard error of the quartiles, 
assuming it to be normal with mean 29*4 years and standard deviation 
8 years, and compare your results with those obtained in Exercise 18.6. 

18.9 Find the standard error of the 27th percentile of the normal dis- 
tribution. 

18.10 (Imaginary data.) A random sample of 1,000 men from the North 
of England shows their mean wage to be £2 7s. per week, with a standard 
deviation of £1 8s. A sample of 1,500 men from the South of England 
gives a mean wage of £2 9s. per week, with a standatd deviation of £2. 
Discuss the suggestion that the mean rate of wages varies as between 
the two regions. 

18.1 1 Two populations have the same mean but the standard deviation of 
one is twice that of the other. Show that in samples of 500 from each 
drawn under simple random conditions the difference of the means will in 
all probability not exceed 0*39, where a is the smaller standard deviation ; 
and assuming the distribution of the difference of means to be normal, 
find the probability that it exceeds half that amount. 

18.12 A random sample of 1,000 farms in a certain year gives an average 
3 rield of wheat of 2,000 lbs. per acre, with a standard deviation of 192 lbs. 
A random sample of 1,000 farms in the following year gives an average 
yield of 2,100 lbs. per acre, with a standard deviation of 224 lbs. Show 
that these data are inconsistent with the hypothesis that the average yields 
in the country as a whole were the same in the two years. 

Would you modify this conclusion if the farms in the second sam^de 
were the same as those in the first ? 

18.13 Find the mean and median of the U-shaped distribution of Table 
4.14, page 96, and compare their standard errors, (For the purpose of 
this exercise the median frequency may be found by simple interpolation, 
but this gives a value on the high side.) 

18.14 The mean of a certain normal distribution is equal to the standard 
error of the mean of samples of 100 from that distribution. Find the 
probability that the mean of a sample of 25 from tiie distribution wQ! be 
negative. 



436 


THEORY OF STATISTICS 


18.15 If it costs a shilling to draw one member of a sample, how much 
would it cost, in sampling from a population with mean 100 and standard 
deviation 10, to take sufficient members to ensure that the mean of the 
sample in all probability would be within 0*01 per cent of the true value ? 
Find the extra cost necessary to double the precision. 

18.16 Consider the data of Table 4.7, page 82, giving the distribution 
of men by height in each of the four countries which then formed part 
of the United Kingdom. The means and standard deviations of the four 
distributions are given in Exercise 5.1, page 122 and Exercise 6.1, page l|48. 

What is the standard error of the mean of a sample which consists! of 
400 men, 100 chosen at random from each of the four countries ? \ 
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THE SAMPLING OF VARIABLES 

LARGE SAMPLES, CONTINUED 


The problem 

19.1 We have just considered the standard errors of the most important 
measures of location, the median and the mean, and of certain measures 
of dispersion, the quantiles and the semi-interquartile range. We now 
proceed to discuss the standard errors of other important parameters, 
including the standard deviation, moments and correlation coefficients. 
All that we have said in regard to sampling distributions generally in 

18.1 to 18.22 applies equally well to this chapter ; and we shall throughout 
the following sections be thinking of simple sampling unless we state 
explicitly to the contrary. 

Standard errors of moments^ 

19.2 The data from which we calculate the moments are arranged into 
a certain number of groups. Suppose there are m such groups, and 
that the expected frequencies falling into them are y^, y*, . . 
y„, where yj+yj-f . . . -|-y„=S{y)=n, n being the number in the 
sample. The expected frequencies are, as shown below, proportional to 
the freqi'encies in the various groups of the parent population. 

Let us in the first place recapitulate some of our earlier work by finding 
the stanoard error of one of the frequencies, say y„ due to fluctuations of 
sampling. 

The probability that an individual chosen from the population falls 

y Vt 

into the sth group is The probability that it does not is 1 — — For n 

individuals the distribution of frequencies is given by the binomial 



with an expected value y, and a standard deviation 

* The studeat wboae main interest lies in the practical af^Ucation of the remits of 
this chapter may prefer to omit paragraphs 19.2 to 19.2. 
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Now, if the sample is large, we can take the observed frequency in the 
sth group in calculating the standard error of the frequency of that group. 
Taking this observed frequency as our estimate of y„ its standard error, 
is given by 

var (y.) =o»^ =y,^l-^*^. . . . (19.1) 

This in another form, is our familiar result for the sampling of attributes. 

19.3 We may now find the correlation between errors in y, and errors 
in another group-frequency, say y|. It is evident that such a correlation 
will exist, for if y, falls below its expected value, some other frequencies 
must be increased. ^ 

Consider the variance of y, -+• y^. We have, from (14.3), page 327;^ 

var (y,-^yt) == vary,-f-vary<+2 cov {y„ y,) . . (19.2) 

Substituting for the variances from (19.1) with the similar expression 

var (y,+y<) = (y.+y*)^! 
we find, after a little rearrangement 


2 cov y<) 
whence 




cov {y„y,l = 


.Mi 

n 


(19.3) 


This is a more general case of the correlation between quantiles which 
we considered in 18.27. For the correlation between y, and y, we haye, 
on dividing (19.3) by the standard deviations — 



19.4 By definition the fth moment about an arbitrary point is where 

=E(4y,) 

* being the variate measured from the arbitrary point. We write a 
deviation in a quantity fi' or y, as S/ij or iy, as the case may be. (The 
symbol S is not to be regarded as a number multiplying /t', or y, but 
as part of the single quantity S/i^ or Sy,.) 

Squaring both sides, 


«*(«/«;)* = (V^yi4-A,*4y,4- . . . +V^y.)* 
== 2{*.*»(4yi)»}+2S'(V*»%A'i) 
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where S' denotes summation over all values of s and t except those for 
which s =« f. 

This equation holds for any one sample, and we have to sum it for all 
samples. Carrying out this summation first (in which s and t are fixed), 
and substituting from equations (19.1) and (19.3) on the right-hand side, 
we have — 




= S(x,*«v.)--S(.r.»y.)S (x/y,) 


Hence, 




Vvar fig’ 




(19.4) 


Example 19.1. — Let us find the standard error of the first moment, 
or mean h. 

We have, from (19.4) — 



Now /t,' —A* is the second moment /t, about the mean, i.e. is a*. 
Hence, 



which is the result we have already found in 18.30. 


Correlation between errors in the ^ and rth moments, both about ttie 
same fixed point 
19.5 As in 19.4 we have — 

nS/,,’ -S(x.%,) 
nSfi/ = S(x/4yJ 

Multiplying, 

» 2(x.«+My.*) -fS'i } 

and summing for all samples, 

«« cov (/»'„ p’r) « var y,) {cov (y„ y,))] 
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On substitution for va.iy, and cov {y„ y,) from (19.1) and (19.3), the right- 
hand side reduces to and hence, 


cov (//;, /j;) 


ftq+r-fi 'Mr 

n 


( 19 . 5 ) 


Standard error of the moments about the mean 

19.6 In 19.4 and 19.5 we have considered moments about a fixed point. 
In practice we have to deal more usually with moments about the mean 
of the sample. Since this mean is itself subject to sampling fluctuations, 
the standard errors of moments about the mean will not in general be the 
same as those about a fixed point \ 

If A is the mean we have, by definition, 

(at,- A)’^,} \ 

= -qh'L (x/-^y,) -t- T 

where T is written generally for an expression involving and higher 
powers of h. 

Now let h vary to vary to y,+^y„ and vary to 

We have-- 


Subtracting the equation for 

Z{xfSy,) - q8hZ{xf H\) - ^X(.y/-'(SA^v,) -( V 

= —nqShSfiq^^ -f- U 

where U will involve h and higher powers. We may neglect the term in 
as being small compared with the remaining terms. Squaring 
and summing for all samples, 

var /«^==var fiq + var h -2qfi'g^^ cov (A, /^') -f f/ 

Substituting for var etc. from (19.4) and (19.5), 

var jW ij 

Now put A=0. U vanishes and the moments become moments about 
the mean and may therefore be written without ^shes. Hence, 




(19.6) 


Correlation between two moments both measured about the mean 
19.7 In a similar way it may be shown that 


TO 

We omit the algebra for the sake of brevity. 


(19.7) 
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Correlation between errors in a moment about a fixed point and in a 
moment about the mean 

19.8 Let us first of all find the correlation between deviations in a 
group-frequency .v, and the moment ft,' about a fixed point. We have : 


Hence, 






the summation 2' being taken over all values of s except s=.'. 
Hence, summing for all samples, 

« cov in',, yt)=xfy, ( 1 ) 

f 


Hence, 


=J’( 


=y,{.r, ’-//,'} 


cov (/<.y,)=--‘(.r 


( 19 . 8 ) 


Similarly, for the product-sum of deviations in and the moment fig 
about the mean, we have — 

Vt QV* 

cov {fi„ y,) =•- ( a - ,«-/</) ~ ^iXt-h)n'..^ 

+ terms in h and higher powers 
Putting h —d, the right-hand side reduces to 

For the product-sum of errors in ft,' and ft,, 

hft, =Sfi,' - -rdhft',^^ -f U 

where U, as before, denotes an expression involving h and higher powers. 
Hence, 

nSfi,'Sfir=^l(x,^Sy,Sft/) -^x.^Sy^Skfii-,) + U 

Summing for all deviations, 

cov (ft',, /t,)=S{A/ cov (y,. ft^)} — 'L{x/rfi,_^ cov (h, y,)} -f U 
and substituting from (19.8) and (19.9) the right-hand side becomes 


ft,+r-ft,’fl/ 




+ U 


n 


n 
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Put h=0. Then, 

09.10) 

Use of Sheppard’s corrections in evaluating standard errors. 

19.9 Theoretically, Sheppard's corrections for grouping are not to be 
used in evaluating the moments which enter into the general equations for 
standard errors obtained in the previous sections. For, as the corrected 
values differ from the uncorrected values only by constants depending on 
the width of the interval, the sampling deviations of corrected andmn- 
corrected moments are equal, and hence so are their standard errors. But 
the standard errors of uncorrected moments are given by the equations\we 
have obtained in the foregoing section, and hence those equations ire 
applicable to corrected moments provided that the uncorrected values are 
used in them. 

In practice, however, it seems to make very little difference which 
moments we use, unless the sample is very large indeed. But as the 
uncorrected values have to be obtained before the corrected values can be 
calculated, and are therefore usually available, it is as well to use the 
uncorrected values wherever possible. 

Standard error of the variance 

19.10 Armed with the general results of the foregoing sections, we can 
^4iscuss the standard errors of a large class of parameters. 

From equation (19.6), putting we have, since /4i==0, 

Vvar //, = . . . {I9.j.l) 

which gives the standard error of the variance /(,. 

If the parent population is normal, 

/<, = O’*, ft^ — 3a* (8,23) 
and hence, -- 

= 09 - 12 ) 

Standard error of the standard deviation 

19.11 If ;t, is the variance, we have — 
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Hence, 

= o*+2a5a+(ia)* 

Neglecting ia* in comparison with Sa, 

i/it — 2oto 

Squaring and summing for all samples, 


Hence, 


var Ut = cj == 4a* var a 
/‘a 


Vvar a = a, 


“ 4 *"'** ~'s/‘ 


Afi^n 


If the parent distribution is normal this reduces to 


(19.13) 


■\/var a = Oo = —— 
V2n 


(19.14) 


19.12 The form of equation (19.14) has been widely used for the standard 
error of <r without due regard to the nature of the parent population, 
and the student should guard against this mistake. 

We have, in fact, from (19.13) — 



How far a„ can be taken to be the value (19.14) therefore depends on 
how dose the factor ^1 -4- -^ - " ^ is to unity, i.e. depends on the kurtosis 
of the parent distribution. 

The following table shows the value of this factor for various values 


A 

2 

3 


0- 7071 

1 - 0000 


4 

5 

6 

7 

8 
9 


1-2247 

1-4142 

1-5811 

1-7321 

1- 8708 

2 - 0000 
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It thus appears that if the population is leptokurtic the real standard 
error is greater than that given by the assumption of normality, and may 
be twice as great or even more. If the population is platykurtic the real 
standard error is less than the normal value. 


If is small, the factor approximately 1 

This differs from unity by more than 5 per cent if is less than 2-8| or 
more than 3 * 2. Hence, values of lying outside the range 2 • 8 to 3 • 2 (^d 
they are more common than not in practice) will give an error of more than 
5 per cent if the population is assumed to be normal. \ 

Example 19.2. — For the height distribution of Table 4.7, page 82, we 
have found that a =2 *57 inches, fi=8585. The population may be tak^ 
to be normal, for p^ from the sample is 3* 149 (Example 7.9, page 164) and 

2*57 

hence the standard error of =0*02 approximately. 

V2X8585 

Hence, we may say that the s.d. in the population almost certainly lies 
in the range 2 *57 ±0-06, assuming that the sampling is simple. 

Example 19.3. — The distribution of Australian marriages of Table 4.8, 
page 84, has uncorrected moments // j and in class-intervals, as follows — 

=7-0570 

fi^ = 408-7382 (Example 7.2, page 157.) 

Hence, 

a = V/Zg = 2-6565 

The standard error of a = . 

V 4/i^n 

^ /408- 7382 -(7-0570)* 

Vix 7-0570x301,785 
= 0*00649 class-intervals 


As we should expect from such a large sample, the standard error is 
very small, and we conclude that the standard deviation of the parent 
lies in the range 2 -6565 ±0-0195. 

It may be pointed out that if we take these data as a sample of 
Australian marriages in general, we may be violating the conditions of 
simple sampling, for the distribution most likely changes from year to 
year. 

Example 19.4. — In the previous example we worked throughout with 
uncorrected values. The corrected moments (Example 7.4, page 159) 
are — 


fi^ =6*9736 
*405-2389 
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We then have, for the corrected value of ct, 

a = V6-9736 
= 2-641 

But the standard error of a is 0-00649 as in the previous example, for we 
must use the uncorrected values in calculating it. i 

As a matter of fact, if we had used the corrected values we should^ 
have found the value 0-00654 — a practically negligible difference even for a 
sample of this size. 

Finally, let us compare this value with that given by the assumption 
of normality. We have — 

_ o _ 2-6565 
" ” V2n~V663,'5T6 


= 0 • 00342 class-intervals 


i.e. only about half the true value. This is in accordance with the result 
of Example 7.6, for is over 8. 


Comparative effects of sampling fluctuations and corrections for grouping 
19.13 Writing temporarily a,* for the uncorrected value of the variance 
and Oj* for the corrected value, we have — 

- A* 

12 

“'■ w,* J_ A* 

Oj* ^ 12 Oj® 

If the class-interval is chosen so as to make the number of intervals d 


then 


would be about dh and — about 


6 

d 



2 


1 


3 

d* 


Hence 


or, since ^ is small 



For instance, if d is 20, the conected value is about 0-375 per cent less 
than the uncorrected value. 

Now, for a normal population. 




a 
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and if n is, say, 1,000, the standard error is ^;|^*=0*0224o’«2*24 per cent 

of o. Thus Sheppard’s correction amounts to no more than about one- 
sixth of the standard error, and to make it gives an almost misleading 
idea of precision in most practical cases. 

It was for this reason that we recommended (6.12 and 9.29) that the 
Sheppard corrections should not be applied if the total frequency is less 
than 1,000. On the other hand, in Examples 19.3 and 19.4 the correction 
is large compared with the standard error and can reasonably be n|ade, 
owing to the largeness of the sample. 




Comparison of standard deviations of two samples 
19.14 As in 18.32, where we considered the comparison of the me^ns 
of two samples, if the samples are independent and come from the same 
population the standard error of the difference of their standard deviations 
is given by 


e»g = 


4/t, 


- 


(19.15) 


where n^, are the numbers in the samples, or, if the population be 
normal. 


_^jl J_' 


(19.16) 


If the two samples are drawn from different populations with constants 
/tj, /It and Vj, Vt, the standard error of the ^fference of the standard 
deviations is given by 


or 


” 4v,n, 


(19.17) 


e 


i 

II 


2nj'’*2«, 


(19.18) 


if the population be normal 

Again, if the standard deviation of one sample is compared with the 
standard deviation of the two samples when pooled, the standard error of 
the difference is, if the distribution be normal. 


2 «i(«i+»a) 


. (19.19) 


These results can be used to test the significance of differences between 
standard deviations precisely as the equations of 18.32 and 18.33 were 
used to test the significance of differences between memis. 
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Standard error of third and fourth moments about the mean 
19.15 From equation (19.6), putting j = 3, 


. (19.20) 


If the distribution is normal, 

/t* = 15o», — 3a*, ytt, = 0, 

Hence, 

“ ’’Vi 

Similarly, from equation (19.6), putting q=A, 

- - h 

''ftt — y 


y«* = a» 


Ms Ma^ ^MsMs "f* 1 ^MiMs^ 


If the distribution is normal, /ig—lOSo®, //g—O. 
Hence, 

(T^ 


— ^/~“V^105 — 9 


““•V! 


(19.21) 


(19.22) 


(19.23) 


Example 19.5. — For the height distribution of Table 4.7 we have 
(Example 7.1, page 153) — 


/t, (uncorrected) = 6 •6168 

/t, (uncorrected) = —0*2078 
(uncorrected) = 137 ’6892 

and from Example 7.3, page 159 — 

/t, (corrected) = 6-5335 

/t, (corrected) = —0*2078 
(corrected) = 134*4100 

We did not calculate higher moments, and hence cannot use equations 
(19.20) and (19.22) with these data. The distribution is, however, 
approximately normal. Hence, from (19.21), 

3>tj = =0*45 approximately 


The value of cannot therefore be judged significantly different from 
zero, which is what we should expect, for we have assumed the population 
to be normal. 
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From (19.23) we have — 



= 4-63 approximately 

These are calculated from the uncorrected value of a. We may infer 
that fix (corrected) lies within the range 134-41±13‘89. The Shepp^d 
correction is only 3 • 28, and is submerged in the possible sampling deviation, 
even for a sample of 8585. What we have said in 19.13 applies, in fa^t, 
a fortiori to the higher moments. ' 

19.16 It will be evident that the standard errors of moments of high 

order are very large ; for the moments increase rapidly, and the standard 

error of the moment of order q depends on the moment of order 2q, For 

example, in the normal distribution, for ^=6, /^ 2 ff= 10,395a'* and a^ will 

100a® * 

be of the order — — , whereas /4^=^15a®. Unless, therefore, n is at least 

Vn 

400, the range 3a^^ will be greater than the value of and hence we 
cannot locate the value of fi^ in the population with any exactness. Our 
approximations, in fact, break down if the deviations are large. 

The large sampling errors of moments of high orders prevent the use 
of moments higher than the fourth in most practical problems. 

Correlation between errors in mean and standard deviation 

19.17 From equation (19.10), putting y— 1, r=2, and remembering that 
we have — 

—a r ^ 

Hence, if /t8=0, errors in the mean and variance, and hence in the 
mean and s.d., are uncorrelated. In particular, we have the important 
result that errors in the mean and s.d. in a normal population are un- 
correlated. In actual fact they are independent, even for small samples, 
but we shall have to state this result without proof. 

Standard error of the coefficient of variation 

19.18 The coefficient of variation V is defined as 

lOOo 

IOOVa, 

h 


V 
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Henct, 


I'4 AF 100V/<j+A//.2 




2/iii I A ) 


Nodecting quantities small compared with S//^ and S/j, this becomes 

,-f, j/h sh) 


r i-f- 


Honce, 


h 


A 1 ' A// j 

AA 

T "" 2 /1, 

h 

(AIT^ (A/1,) = 


■'n ' 4//,* 

’’ /j* 

samples we have — 

O] - var /ij 

j var /( 

4//> 

■ h-^ 


If tlio distribution is nonnal- 


and cov (/ij, h) -0 ;19.17). 
Hence, 


1 , 

o* 


'2« ' 

hhi 


2«1 



V 

V2ft 


(19.24) 


Hence, 


In many practical cases the second term differs little from unity and 


will give a sufficiently precise result. 
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Standard error of and 

19.19 The standard errors of /ffj and can be deduced in a similar 
manner. 

In fact. 


B 


— 


(^^+W; 


which, after some reduction, gives 






3/t-» 


Bi N 

Squaring and summing for all samples- 


!«/** 


« ^/*s* I ^2/^3* 

// ■» ■ Trr (//s, //,) 

* /*2 A *2 ^2 

4/t * 

(/*«-/“8®-6/*4As-9/»a®) 



(/“4 -/»**) 


1 2tf,* 




In terms of /9,, /?*, and /ff 4 (see page 159, footnote, for definition of the 
higher fi‘%), 

var ySj = ^M4/f4-24/J3+36+9ySiy?a-12/?,+35/?i} (19.25) 

Similarly, 


var y?2 = ^ {19426) 


The labour of evaluating these quantities may be obviated by the use 
of tables given in Tables for Statisticians and Biometricians, Part I, 

19.20 There is here one important point to belioted. In equation 
(19.24), if V—i), Similarly, in equation (19.25), if 0, a^,«0. 

It might be thought from this that if in a large sample we find in the one 
case that F=0 (and hence that a=0), or in the other case that the distri- 
bution is symmetrical, then V =0 or /?i=0 in the population. This is not 
necessarily true. 

V will vanish only if all members of the sample give the same value 
of the variate. If the sample is large, it will be evident that if there is 
any variation in the parent it must be small ; but it is not impossible 
that members should exist showing deviations from the observed value. 
The explanation is to be found in the terms which we have neglected 
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in our approximations. These, though in general small compared with 
the terms retained, may be important if the terms retained themselves 
vanish. Futhermore, our assumption that the sample value is .the same 
as the parent value may be unjustified if both are very small compared 
with their difference. Equations such as (19.24) and (19.25) must, there- 
fore, be treated carefully in the neighbourhood of values which cause them 
to vanish. 

19.21 From the foregoing work the student wiU have no difficulty in 
accepting the statement that it is possible to calculate the standard 
error of any quantity which is expressible as a function of the moments. 
Such a standard error would, however, be applicable only to a value 
which had actually been calculated from the moments, and not arrived 
at by some other means. We shall not pursue the subject further in this 
book, but we may point out that the standard errors of certain quantities, 
such as an approximation to the Pearson measure of skewness (7.12), have 
been tabulated in Tables for Statisticians and Biometricians for different 
values of and The same tables also contain some results of interest 
in connection with the sampling distributions of range. 

We now turn to the parameters of multivariate universes, the correla- 
tion coefficients, regression coefficients, and some of the measures of 
association. 


Standard error of the correlation coefficient 


19.22 For samples from a normal population the standard error of the 
correlation coefficient is given by 




l-r* 

v« 


. (19.27) 


A proof of this result would take us beyond the scope of the present 
work. It has to be used with reserve for values of the correlation near 
to unity, since the distribution in such a case is markedly skew unless 
the sample is very large, say, at least 500. When there is any doubt it 
is better to use an alternative test given in 21.33. 

The formula applies also to partial correlations. 

19.23 Formula (19.27) is sometimes used to estimate the precision of 
correlation coefficients obtained by the use of the product-moment formula 
without reference to the nature of the population. This practice is 
hardly to be commended, although sometimes there is nothing better 
to do. It is, however, possible to generalise the procedure of sections 
19.2 to 19.8 to the bivariate case, and it may be shown that 


f* n|/i|x 4 /(Jo 4 /t*o 2 /(jo/*oi IhiNai 


(19.28), 


(For the definition of the bivariate moments, see footnote, page 22^. 
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In addition, if the regression is linear, denoting the of the two 
variates considered separately by yffj, /?|', 


. ( 19 . 29 ) 

which reduces to (19.27) if the kurtosis is zero. 

If the distribution is not normal and r is not small, the difference between 
the values given by (19.27) and (19.29) may be considerable ; but it liay 
be noticed that the value given by (19.27) is less than that given by (19.29) 
if the distribution is platykurtic for both variates, and greater if r|ie 
distribution is leptokurtic for both variates. 

19.24 In particular, it may be shown that for a 2x2 table in which 
the frequencies are [AB), [Afi), {aB) and (ayff), the standard error of the 
correlation coefficient calculated by the product-moment method on the 
assumption that the frequencies are concentrated at points is given by 


n 


1 + 





(A)(a) 




(19.30) 


19.25 The standard error of tetrachoric r, as calculated in the manner 
of 11.32, is given by very complicated expressions which we do not 
reproduce. The coefficient is very sensitive to departures of the parent from 
normality, and no satisfactory test of significance seems to be known. 

Example 19.6. — In the data of Table 9.3, page 202, we found that 
the correlation between the stature of the father and the stature of, the 
son was 0*51. Regarding these data as a sample of 1078 from the popula- 
tion of fathers and sons, we have — 


Standard error of r = 

Vn 


1_-(^51)» 

^1078 


= 0-023 approximately 

Hence, if the sampling was simple, the correlation in the population 
most probably lies within 0-44 and 0-58. It is thus undoubtedly real. 

Example 19.7. — In considering data from 14,416 cows, J. F. Tocher 
found a negative correlation of 0-0796 between yield of milk per week and 
percentage of butter fat. Is this significant, i.e., could it have arisen from 
an uncorrelated population by sampling fluctuations ? 

If r=0. 
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The correlation observed is ten times this, and small though it is, 
could not have arisen from sampling fluctuations. 

In this example we may reiterate the caution to be observed in inferring 
from the sample anything about the population (cows in Scotland) as 
a whole. The records were, in fact, taken by the Scottish Milk Records 
Association from constituent associations at various years between 1908 
and 1923. The conditions of simple sampling may, therefore, have been 
violated both in regard to time and in regard to place. 

Standard mor of the coefficient of regression 

19.26 The standard error of the coefficient of regression from a normal 
population is given by 


(jfVn TfV n 


(19.31) 


This again applies to a regression coefficient of any order, total or 
partial, i.e., in terms of our general notation, k denoting any collection of 
secondary subscripts other than 1 or 2, 


Standard error of Jj* » ) Oj ^ 
for a normal distribution ) ~ <t, ^ Vn 


The correlation ratio and coefficient of multiple correlation 
19.27 It has been shown that the sampling distributions of the correlation 
ratio and the multiple correlation coefficient from normal populations 
do not tend to the normal form for large samples, although they do give 
single-humped distributions. The use of a standard error in such cases 
must be made with great caution, and it is probably better to apply 
one of the tests of significance which we shall consider later in connection 
with the theory of small samples. The formula usually given for the 
standard error of the correlation ratio is an approximate one — 


. ( 19 . 32 ) 

1®.?8 Somewhat similar remarks apply to the coefficient 
which, as we saw in 11.8, may be used to test the linearity of r<^sn%ssion. 
The use of a standard error for ^ in an attempt to gauge the significance of 
a departure from linearity has been subjected to very damaging critidsm. 

Example 19.8.— Consider the data of Example 12.2, page 293 (relation 
between pauperism, age of population and number of population). 

We found — 

Xi = 0 • 325*8 + 1 • 383*, -0 • 383*4 

Taking this to be given by a random sample from a normal populatirm, 
is the value 0*325 significant ? 
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We have 

0,.it4,Vn 

22-8Vl-0-457« 

“ 321V32 

= 0-11 

The coefficient is therefore significant. 

In this example the number in the sample is not as laigc ax, uue luuui 
wish and the standard error is probably underestimated ; but if &y 
doubt exists it is possible to make more definite tests by the methods\of 
Chapter 21. 

Standard error of coefficient of association 

19.29 We may refer briefly to the quantities treated in Chapters 2 and 3, 
in considering the association of attributes. 

The coefficient of association, Q, defined in 2.15, has a standard error 
given by 

1_02 / “1 1 i p 

This quantity is not infinite, as might at first sight appear, if one of 
the cell frequencies vanishes, because in that case 1 —Q^ also vanishes ; in 
fact, in such an event o,=0. 

Standard error of the coefficient of mean-square contingency 

19.30 The determination of the standard error of the coefficient of 
mean-square contingency is a matter of considerable mathematical com- 
plexity, and even when approximations are employed, leads to expressions 
which are tedious to calculate, in practice. For a detailed discussion we 
must refer the student to the original memoirs (K. Pearson, Biometrika, 
1913, 9 , 22 and T. Hondo, Biometrika, 1929, 21 , 376). 

Spearman's rank corrdation coefficient 

19.31 Unlike most of the parameters we have been considering, the 
distribution of Spearman’s rank correlation coefficient is discontinuous, 
and to that extent resembles the binomial. Very little is known about the 
distribution except in the important case when the correlation in the 
population is zero. The other cases are sometimes treated by assuming a 
normal continuous distribution in the parent and working from ranks to 
grades and thence to the product-moment coeffident of correlation by 
the equations (11.21) and (11.23) of 11 . 29 ; but this procedure is not 
to be recommended. 
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The case when the correlation in the population is zero, i.e., when all 
possible permutations of the ranks occur with equal frequency, has to some 
extent been investigated. It was shown by “ Student ** in 1907 that the 
standard deviation of Spearman^s rank correlation coef&cient is given 
by the simple equation 


.... (19.34) 


This cannot be taken to be a standard error in the ordinary way, 
because the distribution is not normal for small samples. It has also 
been shown that the distribution tends to normality as n increases, but 
for low values of n the normal distribution gives an unsatisfactory approxi- 
mation. For values of n greater than 8 the significance of an observed p 
can be tested in the ^distribution (see below, 21.25) by entering the tables 
with t^pV (n — 2) iV (1 —p^) and 2. 

The rank correlation coefficient r 

19.32 For the coefficient r more information is available. Kendall 
{Advanced Theory of Statistics, Vol. 1, chapter 16) has given the actual 
distribution up to and including n=l0 in the case where all possible 
rankings occur equally frequently, and has shown that the distribution 
tends to normality more rapidly than that of p. For values greater than 
n = 10 the distribution can be assumed to be normal with a standard 
error given by 


4 


2{2n+5) 
9n[n -1) 


. (19.35) 


19.33 Tests of p or r based on the results given in the two preceding 
sections take as tlie h 5 ^pothesis that there is no correlation in the popula- 
tion. For instance, suppose a value of r in a ranking of 15 was found 
to be 0*6. For the standard error we find, from (19.35), a value of 0*19. 
The observed value exceeds thrice this amount and is significant. Our 
argument is as follows — 

If there were no correlation in the population from which this ranking 
is supposed to have been drawn as a sample, the order of appearance 
of one variate is just as likely as any other order. Consequently, in 
continued sampling we should, in the long run, obtain all possible rankings 
of one variate with any particular ranking of the other. The population 
of values of r so generated has a standard deviation given by (19.35). 
Our observed value is very improbable in relation to this distribution, and 
hence we suspect the hypothesis that the variates are independent, 

19.34 But we have said nothing about the case when the variates wet not 
independent m the population and the foregoing results cannot be used 
to test the difference of two rank correlatior. coefficients. Nothing appears 
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to be known on this point in relation to p, but some light has been thrown 
on it in regard to r. In fact it may be shown — 


(а) That the observed value of r is a good estimate of the value in the 
parent population ; 

/2' 

(б) That the standard error of r is not greater than -v 


This limit is in some cases nearly reached so that no lower limit app^rs 
possible. The test based on it may be rather insensitive but it sefems 
unlikely that any improvement can be effected unless some further 
assumption is made about the nature of the parent population. (For 
the further theory of this subject see Kendall's Rank Correlation Methods, 
1948, Griffin). 


SUMMARY 


1. The following are the standard errors of the parameters named, the 
parent population being assumed normal — 


Variance 

Standard deviation 
Coefficient of variation 
Correlation coefficient 
Regression coefficient 


'41 


a 

Wn 


_K 




2F* 

■f6« 


1-r* 

V n 

_ 

cr2\/w 

2. The standard error of the yth moment measured about the mean is 


or 


' 1.2 


given by 






-29A,ri/“»+i 


8. The correlation between errors in the yth and rth moments, both 
measured about the mean, is given by 


cov 




4. From the results of (2) and (3), and similar results for moments 
about a fixed point, it is possible to calculate the standard error of any 
function of the moments. 
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5. In the normal population, errors in the mean and standard deviation 
are uncorrelated. 

6. In calculating the standard errors of moments the uncorrected 
values should be used. 

7. It is unsafe to use the formulae for standard errors appropriate to the 
normal population in cases where the population is suspected to differ from 
the normal form ; in particular, the formula for the standard error of the 

a 

standard deviation, should not be used for parent populations which 
are ma^dly lepto- or platy-kurtic. 

8. Tests are given for the significance of the rank correlation coefficient 
p and T when no parental correlation exists. When there is parent correla- 
tion an upper limit to the standard error of r is given by 

V'd-’-) 


EXERCISES 

19.1 In the weight distribution of Exercise 4.6, page 100, last column, 
find the standard error of the standard deviation. Compare it with 
the value obtained on the assumption that the parent distribution is 
normal. 

19.2 In the same data, compare the ratio of the s.e. of the s.d. to the s.d. 
with the ratio of the s.e. of the semi-interquartile range to the semi-inter- 
quartile range. 

19.3 Show that for a normal population the standard error of the s.d. is 
less than the standard error of the semi-interquartile range. 

19.4 In a sample of 1,000 the mean is found to be 17*5 and the standard 
deviation 2-5. In another sample of 800 the mean is 18 and the standard 
deviation 2-7. Assuming that the samples are independent, discuss 
whether the tw^o samples can have come from populations which have 
the same standard deviation. 

19.5 Find the correlation between errors in the mean and standard devia- 
tion for the height distribution of 8585 men of Table 4.7, page 82, and do 
the same for the marriage distribution of Table 4.8, page 84. 

19.6 Find the standard errors of the first four cumulants as calculated 
from the moments. 

19.7 Samples of 10,000 are taken from a normal population. For what 
even moments does the standard error of the moment lie within 10 per 
cent of the value of that moment ? 
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19.8 For samples of (a) 100. (b) 1,000, draw a graph showing how the 
standard error of the correlation coefficient from a normal population 
varies with r. 

19.9 (Data quoted by M. F. Hoadley, " Note on the Association of 
Relative Laterality of Hand and Eye from the Cambridge Anthropometric 
Data,” Biometrika, 1928, 20B, 401.) 

Three experiments were conducted to determine the relationship between 
laterality of hand and laterality of eye. The correlations betweenj (1) 
difference of strength of grip and (2) difference in visual acuity were — 

-0-02410 (3234 subjects) 

-0-00738 (4003 subjects) 

+0-02962 (1447 subjects) 

Find the standard errors of the three correlation coefficients, and hence 
show that it cannot be concluded that there is any significant correlation 
between laterality of hand and laterality of eye. 

19.10 Find the standard errors of the partial correlation coefficients of 
Example 12.1, page 290. Hence state whether any one is not significantly 
different from zero, and if so, which. For the purpose of this exercise 
normality may be assumed, although in all probability the actual data 
do not emanate from a normal population. 



CHAPTBR TWENTY 


THE x' DISTRIBUTION 


20.1 In Chapters 17 to 19 we have seen that a knowledge of the sampling 
distribution of a statistic gives us a means of judging from s am ples 
the relationship between fact and theory. For instance, in Example 17.3, 
page 389, we were able to infer from a knowledge of the binomial distribu- 
tion that the dice which provided the data were probably biased ; and 
in Example 18.6, page 428, we could apply a knowledge of the distribution 
of the mean of samples from a normal population to reject the hypothesis 
that the mean in the population was less than 67 inches. 

In the present chapter we shall discuss a particular sampling distribution 
of profound importance in statistical theory, and shall note its applications 
to the testing of accordance between fact and hypothesis in a wide range 
of cases. 

Celb 

20.2 In what follows we shall consider only data giving the frequencies 
of individuals falling within various categories. Statistical data, as wiU 
have been evident from the examples already given in this book, are very 
often of this type. 

Such data, whether relating to attributes or to continuous variates 
or to a mixture of both, will in practice be arranged in compartments. 
For example, in the association table on page 20 there are four com- 
partments, corresponding to the four ultimate classes. In the table of 
frequencies within various height ranges (Table 4.7, page 82), each rar^e 
determines a compartment, and the data consists of 8585 individuals 
distributed in 21 groups. 

It is convenient to have a name for these compartments. We shall 
call them cells. The frequency falling in a cell will be referred to as the 
cell frequency. 

One and the same table may contain frequencies of more than one 
order, and frequencies of different orders must be kept distinct. Thus 
an association table has four celb with frequencies of the second order 
and two sets of two (the border frequencies) of the first order. A. px.q 
contingency table has pq edb of the second order (to condaise our ter- 
minology) and a set of ^ and a set of 9 of the first order. Each such set 
most be considered by itsdf. The tests of thb chapter are appUcaUe 
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to any homogeneous ^et. but not to a ' mixed " set comprising cells of 
different orders. 

20.3 We shall denote the number of cells in the presentation of a set 
of data by «, and the cell fiequency occurring in the rth cell by Thus, 
in the table of page 82 we have, numbering the cells downwards — 

Wj == 2 

Wg == 4 
W3 = 14 

m^i = 2 ' 

20.4 In the class of cases we shall consider, we wish to compare the 
actual values m with the cell frequencies which would exist if a particular 
hypothesis H were exactly verified. These latter values we shall denote 
by the letter w, so that the theoretical frequency in the rth cell is 

The cell frequencies m, are sometimes referred to as the *' expected " 
values on the hypotheses H. This is rather a special use of the word 
“ expected,*' in the sense we have already given, namely, that the 
assume the values which they would take if the hypothesis were exactly 
verified for the particular set of data. 

We shall write — 

.... ( 20 . 1 ) 

so that the x^s are the excesses of the actual over the expected frequencies. 

Clearly the quantities x embody all the information in the data about 
the discrepancies between theory and fact. If the .%'s are all zero, fact 
and theory are in perfect agreement. If the ^’s are large, the agreement 
is poor. 

Example 20.1. — As a simple example let us consider the 2x2 con- 
tingency table of Example 2.5, page 25. Numbering the cells from left 
to right we have — 

= 276, mg = 3 

m3 = 473, m4 = 66 

Now let our hypothesis H be that inoculation and exemption from attack 
are independent. If this be so, the expected frequencies are — 

m^ — 255-5, mg ==: 23-5 

mg 493*5, m^ 45*5 

and hence we have — 

= Wj— mi == 20-5, -20*5 

^8='~20*5, X4s^20*5 

The x*s are, in fact, in this particular case, the numbers we referred to in 
Chapter 2 as ^-numbers. We have already considered them as reflecting 
the divergence of fact from theory. 
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Coostraints 

20.5 In the example we have just considered, one important effect is to 
be noted, viz. that when we have calculated one independent frequency, 
say Wj, the other three follow arithmetically from the fact that the two 
frequencies in any row or column must add up to the border frequency 
in that row or column. 

In fact, we have — 

Xi+Xj == 0| 

( 20 . 2 ) 
= 0) 

We need not add x^-\rx^—0, since this is given by the last two equations 
in conjunction with the first. There are only three independent equations. 

Thus, whatever our hypothesis H may be, the conditions of the problem 
impose limitations, expressed by the equations (20.2), on the way in which 
the m’s and the x's may be chosen. If one m or one x is fixed by H, the 
other three are determinate in accordance with the conditions of the 
data theni.selvcs. 

Similarly, suppose we wished to examine the height data of page 82 
in the light of the hypothesis that the parent distribution, of which this 
is a sample, is normal with given mean and standard deviation. With 
the aid of the table of the probability integral we can determine the cell 
frequencies on this hypothesis ; but again the problem imposes a limita- 
tion on the way in which the theoretical coll frequencies are assigned, 
namely, that they must add up to the total number 8585 of the sample. 
When 20 frequencies are fixed, the other is determined by mere arithmetic. 

20.6 In general, when the conditions of the problem impose limitations 
of this kind on the number of cell frequencies which may be fixed by H 
we say, borrow ing an expression from Statics, that they impose constraints. 
In the example of the 2x2 contingency table there were three independent 
constraints, expressed by the equations (20,2). In the case of the height 
distribution there is one constraint expressed by the fact that the sum 
of the cell frequencies must be 8585. 

Linear constraints 

20.7 Constraints which involve linear equations in the cell frequencies 
(i.e. equations containing no squares or higher powers of the frequencies) 
are called linear constraints. The two instances above are of this type. 
Linear constraints are of paramount importance, and we shall shortly 
confine our attention to them alone. 

Degrees of freedom 

20.8 We denote the number of independent constraints in a set of data 
by K. We then define the number v by the simple equation 


V =« n—ic 
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and call y the number of degrees of freedom of the aggregate of cells. It 
is the number of cell frequencies which can be assigned at will, the 
remaining k following from the conditions to which the data are subject. 

Thus, for the 2x2 table #c=3 and for, as we have seen, the fixing 
of one cell frequency fixes them all. For the height distribution 
f=20. 

Example 20.2. — ^Let us find the number of degrees of freedom ol a 
pxq contingency table. ! 

The constraints of such a table are similar to those of the 2x2 table. 
Thus the sum of the ceU frequencies in each row is determined as bemg 
the border frequency in that row, and similarly for the columns. Hence 
each of the p columns and q rows imposes a constraint. From the totiU 
p+q constraints we must, however, subtract one, for they are not 
algebraically independent ; there is one relation between them, expressed 
by the fact that the sum of the border column equals the sum of the 
border row, namely, the total frequency N, 

Hence there are p+q—\ independent linear constraints. Hence, 

V = X 

We might have got this result more directly by considering that the 
cell frequencies in the first p—\ columns and q—\ rows are determinable 
at will, the rest following automatically from the border frequencies. 
Hence the number of degrees of freedom, being the number of cells which 
can be so filled, is (^— 1)(9— 1) as before. 

20.9 Now let us consider a set of data arranged in n cells, the total 
frequency being N. 

The theoretical frequency in the rth cell is m,. This means that the 

chance of an individual falling into this cell is and the chance of its 

not doing so is We may regard the actual frequencies »B as 

having been arrived at by distributing the JV individuals among the 
n cells in such a way that the chance of an individual falling into the 

rth cell is Hence the probability that of the N individuals, fS, fall 

into the rth cell and the remainder elsewhere is the term involving 



in the binomial 
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Thus, this binomial will ^ve us the relative frequencies of the various 
values which th^ can take in different samples, of which the actual data 
form one. 

fH 

U N is fairly large and is not small, this distribution is approxi- 
mately normal with mean That is to say, is distributed normally 
about a mean or x, is distributed normally about zero mean. 


Definition of x‘ 

20.10 We now define the quantity x* by the equation 



(20.3) 


the summation being taken over the n cells. 

The student can verify for himself that this definition is consistent 
with that given in equation (3.4), page 52, for the particular case of 
divergence from independence in a contingency table. 

We can write x* in a slightly different form. For 








-N 


This corresponds to equation (3.7), page 53. 






(20.4) 


20.11 If x*=0 all the x’s are zero, and hence the actual cell frequencies 
coincide with the expected cell frequencies. On the other hand, if some 
or all of the x’s are large, x* will be large. 

It will thus be evident that x* affords a measure of the correspondence 
between fact and theory. It must not be forgotten, however, that it 
ignores the signs of the x's and hence takes no cognisance of certain 
information which those signs may convey. We shall take up this point 
again later. 


20.12 If the use of x* is to be satisfactory, we must be able to <hs- 
tinguish significant values from those which may have arisen by sampling 
fluctuations. This leads us to inquire what is the probability of getting 
a particular value of x* from a set of »R,’s cho.sen at random, and this in 
turn leads to the question ; What is the sampling distribution of x* ? 

We shall not give a proof here of the important answer to this question, 
but shall contoit ourselves with quoting it and indicating briefly the 
method by which it is obtained. 
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We have already seen that the sum of n normally distributed variates 
is itself normally distributed (10.8). The sum of the squares of n normal 
variates is not so distributed, however. In fact, the sum of the squares 
of n normal variates, drawn from a population with unit standard devia- 
tion and zero mean is distributed in a form given by the equation 


V = Vo^ 2Vn-l ^ ^2J).5) 

where 2^ is the sum in question. \ 

Now it has already been shown that under tlie conditions assurn(ed 
the x*s are each distributed normally about zero mean, and it may \>e 
shown further that be regarded as the sum of the squares of'^ 

variates each distributed normally with unit s.d. and about a zero mean. 
Hence the distribution of is given by 




2 


( 20 . 6 )' 


20.13 It follows, as in 18.8, that if we take a random set of w s and 
calculate x^ from them, the probability of getting a value of x^ as great 
as, or greater than, this observed value Xo^» is the area of the curve (20.S) 
to the right of the ordinate at Xo divided by the total area of the curve ; 
or, in the language of the integral calculus. 


P- 


reo 


X’ 

X* 

iX''-^dx 


(20.7) t 


The curve, as we shall see later, extends from 0 to + 00 , which accounts 
for the limits of the integral in the denominator of the above expression. 


♦ Since the variate in this expression is x> distribution shouM, perhaps, be known 
as the ;^-distribution. not the ;i^--distrjbution The latter name is, liowever, in universal 
use, and the tables of the integral of equatif>n (20.7) are usuallv prepared uith argu- 
ment X*- 


I The actual values of P arc. expanding this integral, 




3 ^1,3.5 ^ 


+ 1 , 3.5 


X*'“* \ 

. . . {v-2)j 


if V is odd 


“ixV 


2 


2,4" 


X* 

2.4.6. 


. . ) 

^2.4.6 . . . (v-2)) 

if If is even 


The first term of the first series may be obtained from the probability integral 
Values of P for given x® and v are provided in Table^i for Statisticians and Biometricians, 
a newjedition of which, in course of prqjaration, gives more detailed tables than have 
hitherto been available. 
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Tabulation of P for the x* distribution 

20.14 The rather formidable result of equation (20.7) need occasion 

no alarm to the student who is unacquainted with the notation and methods 
of the integral calculus. The function P has been tabulated for certain 
ranges of v and same way as the probability for the normal 

curve, and the tables are in most cases sufficient for the practice applica- 
tion of the results of the present chapter. More convenient is the table 
given in Appendix Table 3, which shows the values of x^ given values 
of V and P. 

20.15 It is desirable to point out that other writers have used different 
letters to denote the number of degrees of freedom. Karl Pearson, in 
the tables to which we have just referred, used the number w', which is 
one more than our v. R. A. Fisher writes n instead of our v, so that we 
have — 

j/ = n' — 1 (Pearson) ~ n (Fisher) 

We have thought it desirable to introduce the symbol v in order to avoid 
confusion with the use of n' and n as numbers in a sample or in a popula- 
tion. 

The x^ significance when the theoretical cell frequencies are known 

a priori 

20.16 Armed with Appendix Table 3, we can now proceed as follows — 
Having decided on the hypothesis to be tested, we calculate from it 

the theoretical frequencies (For the present we assume that this can 
be done without reference to the observed frequencies The contrary 
case will be considered later.) 

From the and the we calculate x* according to (20.3) or (20.4). 
We also ascertain v. 

Then, from the table W’e determine whereabouts this value of x® lies in 
relation to P. 

The value P gives us the probability that on random sampling we should 
get a value of x* as great as, or greater than, the value actually obtained. 

Now, if P is small, our data give us an improbable value of x®- Thus 
we have the alternative conclusions that either an improbable event 
has occurred, or (i>) that the divergence of fact from theory is significant 
of some real effect and cannot be attributed to fluctuations of sampling. 
The smaller P is, the more we incline to the latter alternative ; if we do 
decide to adopt it, the inferences we draw will depend on the nature of the 
problem. Sometimes it will lead us to reject our hypothesis. Sometimes 
it will lead us to suspect our sampling technique. 

The following examples will illustrate the type of reasoning involved in 
applying the x* test. 
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Example 20.3. — ^In some experiments on dice^throwing W. F. R. Weldon 
rolled 12 dice 26,306 times, observing at each throw the number of dice 


recording a 5 or a 6. 

If the dice are unbiased, the chance of getting a 5 or a 6 with one die 
is J. Hence the chances with 12 dice of getting 12 5’s or 6’s, 1 1 5’s or 6’8, 
etc., are4he successive terms in the binomial Hence the theo- 

retical frequencies in 26,306 throws are the terms in 26,306 


These are our »»/s. 

The following table shows the actual (»8,) and the theoretical (^) 

frequencies, together with the values of - 

m. 


TABLE 20.1—12 dice thrown 26,306 times, a throw of 5 or 6 redconed a sucoew 


Number of 
successes 

Observed 

frequency 

m 

Theoretical 

frequency 

M—nt 

{*) 


0 

18$ 

203 

- 18 

1*596 

1 

1,149 

1,217 

- 68 

3*800 

2 

3,265 

3,345 

- 80 

1-913 

3 

5,475 

5,576 

-101 

1*829 

4 

6,114 

6,273 

-159 

4*030 

5 

5.194 

5,018 

-1-176 

6*173 

6 

3,067 

2,927 

+ 140 

6-696 

7 

1,331 

1,254 

+ 77 

4*728 

8 

403 

392 

+ 11 


9 

105 

87 

+ 18 

3*724 

10 and over 

18 

14 

+ 4 

1*143 

Totals 

26,306 

26,306 

0 

35*941 


Hence ;^*=35*941, and »'==one less than the number of cells=10. 

From the Tables for Statisticians and Biometricians we have, when 

,;==10 («'= 11 ), 

0 000857 for x'-SO 
P =0 000017 for x*=40 

Evidently when x*— 35 •941, P will be extremely smalL If we want to 
evaluate it exactly we can proceed by the methods' given in the Tables. 
In fact P=0 000086. 

Alternatively, from Appendix Table 3 we see that when x*=23*209 
and F=10, the value of P is O-Ol. Thus P for x*»®35*941 most be much 
less than this value. 

We may therefore say that the correspondence between theory and 
fact is very poor. The extreme imjnrobability of the observed ev«it 
enables us to say with some confidence that the divergence between the 
two is significant, and hence that either our sampling technique or our 
hypothesis is at fault. Now in this experiment Weldon took particular 
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care with the dice-throwing, and we may regard it as unlikely that there 
was anything seriously wrong with the randomness of the sampling. We 
are therefore led to doubt our hypothesis that the dice were unbiased. 
^Briefly, then, the test suggests that the dice were biased. 

Example 20.4. — ^The following table shows the result of inoculation 
against cholera on a certain tea estate — 

TABLE 20.2 



Not-attacked 

Attacked 

Total 

Inoculated . . 

431 

(427-7) 

5 

(8-3) 

436 

Not-incx:ulated . . j 

291 

! (294*3) 

i 

9 ! 

(5-7) i 

300 

Total 

722 

14 

736 


We shall explain the figures in brackets presently. The question on which 
we want to throw light is : Is there any significant association between 
inoculation and attack ? 

To answer this, let us take for our hypothesis H the supposition that 
they are independent. If this is so, the expected frequencies, calculated 
in the manner of Chapter 2, are those given in brackets. These we take 
to be the the being the actual frequencies. We then have — 

X* - (3-3)*j^^^+^+^j4T3+^} = 3-27 
and 


From Appendix Table 3 for x*=2-706. P=010 and for x*==3-841, 
P^O-05. For our observed value of 3-27, P lies between 0 05 and 0* 10. 

Thus if H is true, our data give a result which would be obtained between 
5 and 10 times in a hundred trials. This is infrequent, but not very in- 
frequent. Moreover, the theoretical frequencies in the "attacked” 
column are not very large. We should therefore be unjustified in rejecting 
H on this evidence, but we can say that the data lend some colour to the 
supposition that H is not correct. 

To sum up, the x* test shows that the data incline us, though not 
strongly, to the belief that inoculation and attack are associated. 

Examine 20.5. — (Imaginary data.) An investigator into chocolate 
consumption divided the United Kingdom into ei^ht areas and took a 

% 
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random sample from each, the individuals so obtained being classified as 
consumers or non-consumers of chocolate. His results were as follows — 


TABLE 20.3 


Area number 

1 

2 

3 

4 

5 

6 

7 

S 

Total 

Consumers . 

56 

(55) 

87 

(81) 

142 

(152) 

71 

(69) 

88 

(90) 

72 

(72) 

100 

(95) 

142 

(144) 

758 

Ncn-consumers . 

17 

20 

58 

20 

31 

23 

25 

48 

242 1 


(18) 

(26) 

(48) 

(22) 

(29) 

(23) 

(30) 

(46) 

1 

Total 

73 

107 

200 

91 

119 

95 

125 

190 

1,000 


Do these results suggest that the consumption of chocolate varies 
from place to place ? \ 

Let us take as our hypothesis H the supposition that it does not, i.e. 
that the two attributes in the above table are independent. The theo- 
retical frequencies are then those shown in brackets, and we have — 


12 02 

C* — ce:+oi+^^ similar terms 
55 ol 


6-28 


The table has'two rows and eight columns, and hence >>=(2—1) (8—1)-— 7. 
From Appendix Table 3 we have for v— 7, x®==6-346, P = 0-50; or 
altemativelv, from the Tables for Statisticians and Biometricians for 
v=7 (n'=8j, 

ifX*=6. P= 0-539750 

ifX* = 7, P = 0-428880 

Hence, for ;f*=6-28, P==0-51 approximately. 

Thus there is no cause to suspect our hypothesis, and the data do not 
suggest that the proportion of consumers of chocolate varies from place 
to place, at least so far as this test is concerned. 

Properties of the distribution 
20.17 The curves 


y =yoe 

and the probability function P derived from them, have several interesting 
properties which are worth noticing. As x* is essentially positive, we 
consider only positive values of the variate. 

(a) In the first place, it will be seen that when i'=l the curve is the 
normal curve with unit standard deviation, for positive values of the 
variate. Thus the test iorv—l may be reduced to testing the significance 
of deviations of a normally distributed variate. 
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(6) When V > 1 the curve is of the single-humped type. It is tangential 
to the *-axis at the origin (;i^*=0), rises to a maximum where ;n;*=v— 1 and 
then falls more slowly to zero as x® increases indefinitely. It is thus skew 
to the right. 

(c) As V increases, the curve becomes more and more symmetrical. In 
fact, large, V2x* is distributed approximately normally about a 

mean ‘\/2v — l with unit standard deviation. This result, due to R. A. 
Fisher, enables us to dispense with tables of P for large values of p, say 
V > 30, and to use the normal integral instead. In practice large values 
of V are rather infrequent. 

Example 20.6.— To find P when x*=64 and v=41. 

We know that V2x° is distributed normally about mean 'v/82— 1=9 
with unit standard deviation. When x*=64, V2x*=ll -314, which 
therefore has a deviation 2- 314 to the right of the mean. Hence we have 
to find the area of the probability curve to the right of the ordinate which 
is 2-314 units to the right of the mean. From Appendix Table 2 this is 
seen to be 0-0103 approximately. 

Conditions for the application of the 

20.18 We may conveniently bring together at this point the various 
precautions which should be observed in applying the x* distribution to a 
test of significance. 

(a) In the first place. N must be reasonably large. Otherwise the x’s 
are not normally distributed. 

This is a condition which is almost always fulfilled in practice. It is 
difficult to say exactly what constitutes largeness, but as an arbitrary 
figure we may say that N should be at least 50, however few the cells. 

(b) No theoretical cell frequency should be small. Here again it is 
hard to say what constitutes smallness, but 5 should be regarded as the 
very minimum, and 10 is better. 

In practice, data not infrequently contain cell frequencies below these 
limits. As a rule the difficulty may be met by amalgamating such cells 
into a single cell. Thus, in Example 20.3 above, the theoretical numbers 
of throws with 10, 11 and 12 successes are (to the nearest integer) 13, 1 
and 0. Instead of putting each into a separate cell we have run them 
together into one cell " 10 and over.” 

(c) The constraints must be linear. The reason for this condition has 
not emerged explicitly in the foregoing because we omitted the stage in 
the proof of the x* distribution at which it occurs. 

20.19 To these three conditions we may add the following remarks, 
which should also be borne in mind when the x* test is being used. 

(a) The x* test tells us the probability of getting, on a random sample, 
a vdue of X* equal to or higher than the actual value. If this probability 
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is small we are justified in suspecting a significant divergence between 
theory and experiment. 

We cannot proceed, however, in the reverse direction and say that if P 
is not small our hypothesis is proved correct. All that we can say is that 
the test reveals no grounds for supposing the hypothesis incorrect ; or 
alternatively, that so far as the x* test is concerned, data and hypothesis 
are in agreement. 

(b) Nor do only small values of P lead us to suspect our hypothesis or 
our sampling technique. A value of P very near to unity may a^ 
do so. I 

This rather surprising result arises in this way : a large value of ^ 
normally corresponds to a small value of x*. that is to say a very clo^ 
agreement between theory and fact. Now such agreements are rare-4- 
almost as rare as great divergences. 

We are just as unlikely to get very good correspondence between fact 
and theory as we are to get very bad correspondence and, for precisely the 
same reasons, we must suspect our sampling technique if we do. In short, 
very close correspondence is too good to be true. 

The student who feels some hesitation about this statement may like to 
reassure himself with the following example. An investigator- says that he 
threw a die 600 times and got exactly 100 of each number from 1 to 6. 
This is the theoretical expectation, x*=0 smd P—1, but should we believe 
him ? We might, if we knew him very well, but we should probably 
regard him as somewhat lucky, which is only another way of saying that 
he has brought off a very improbable event. 

20.20 At this point we can resume a topic which we laid on one side 
in 20.11, namely the signs of the x's, which are ignored by x*- 

It may happen that x* bas quite a moderate value and P is not small'' 
when all the positive x's are on one side of the mode of the tlieoretical 
distribution and all the negative x's on the other. There will thus be a 
consistent " shift ” of the rB’s one way or the other from the m’s. This 
may give us a value of the mean quite outside the limits of sampling. 
Again, if the x's are all negative in the cells farthest removed from the 
mean, the standard deviation may show an almost impossible divergence 
from expectation. 

Thus, although the x* test may reveal no cause to suspect the hypothesis, 
a closer examination of the x's may. 

Example 20.7. — Consider the following dice data (Table 20.4) (Weldon, 
see Example 19.1.) 

Now, in this example, all the x's are negative up to 5 successes, positive 
from 6 to 10 successes, and negative again for 11 to 12 successes. This is 
almost one of the cases we referred to earlier in this section. 

We have, in fact, already found (Example 17A page that the 
mean deviates from the expected value by 5* 13 times the standard error. 
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TABLE M.4.— 12 din Uutown 4.096 ttowt, a tbiow of 4, 5 or 6 {Kdato redcooed a 

iltCCMt 


Number of 
successes 

Observed 

frequency 

m 

Expected 

frequency 

(m) 

4096(i+i)“ 

w 

m 

0 

0 

1 

- 1 


1 

7 

12 

- 5 


2 


66 

- 6 

0*5455 

3 

198 

220 

-22 

2*2000 

4 


495 

-65 

8*5354 

5 

731 

792 

-61 1 

4*6982 

6 

948 

924 

24 


7 

847 

792 

55 

3-8194 

8 

536 

495 

41 

3*3960 

9 

257 


37 

6*2227 

10 

71 

66 

5 

0*3788 

11 

12 

■j}" 



0*3077 

Totals 

4096 

4096 

0 

33*8104«x* 


From the tables we find- 

p n' P 

12 13 30 0-002792 

12 13 40 0000072 

Hence, by simple intei^olation for ;^«»33-8104, P=0*0018. 

As a matter of fact, simple interpolation is of very little value for small values of 
P {cf, 24.12) , and this value is wide of the mark, the true value being 0 • 00072 . Appendix 
Table 3 shows us that P is less than 0*01. 

From the extended tables of the normal integral in Tables for Statisticians 
and Biometricians, Part I, we have — 

Greater fraction of the area of a normal 
curve for a deviation 5- 13 . . 0 -9999998551 

Area in the tail of the curve . . . 0-0000001449 

Area in both tails .... 0-0000002898 

so that the probability of getting such a deviation t or — ) on random 
sampling is only about 3 in 10,000,000. 

Comparing this with the value of P, we see that the data are really more 
divergent from theory than the x* test would lead us to suppose. 

20.21 Hence, if the signs of the x’s show any marked peculiarities, 
it is as well to apply as many supplementary tests as are available, and 
not to rely on the x* t«st alone. Such tests would include those for the 
significance of the mean and standard deviation, which we have already 
discussed. 

Levdt of significance 

20.22 In the examples we have given above, our judgment whether P 
was small enough to justify us in susp«:ting a significant difierence between 
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fact and theory has been more or less intuitive. Most people would agree, 
in Example 20.3, that a probability of only O’ 0001 is so small that the 
evidence is very much in favour of the supposition that the dice were biased. 
But we shall not always get such a decisive result. Suppose we had 
obtained P—0-1, so that the odds against the event are nine to one. Is 
this value small enough to lead us to suspect the dice ? If it is not, would 
P=0’01 be small enough ? Where, if anywhere, can we draw the line ? 

The odds against the observed event which influence a decision one 
way or the other depend to some extent on the caution of the investigator. 
Some people (not necessarily statisticians) would regard odds of ten to one 
as sufficient. Others would be more conservative and reserve judgmenit 
until the odds were much greater. It is a matter of personal taste. \ 


20.23 There are, however, two values of P which are widely used to , 
provide a rough line of demarcation between acceptance and rejection of 
the significance of observed deviations. These values are P=0’05 and 
P =0 • 01 , and are said to define 5 per cent and 1 per cent levels of significance. 
The value P=0’001, i.e. the 0*1 per cent level, is also used. A value of 
P less than 0 • 05 will be said to fall below the 5 per cent level of significance, 
and so on. The values of the 5 per cent and the I per cent levels, among 
o^ers, are tabulated in Appendix Table 3. 

/ Example 20.8. — Let us consider the data of Exercise 2.11. In experi- 
ments on the Spahlinger anti-tuberculosis vaccine the following results were 
obtained. (As before, the figures in brackets are the independence values.) 



Died or seriously 
affected 

Unaffected or not 
seriously affected 

Total 

Inoculated . . , < 

6 

(8-87) 

13 

(10*13) 

19 

Not inoculated or inocu- ^ 

8 

3 

11 

lated with control media \ 

I 

(5.13) 1 

(5-87) 


Total 1 

1 

14 i 

1 

16 

30 


Here, 

— 4*75 and v —I 

From Appendix Table 3 we have when »'=1 for P=0 05, x*=3’841, and 
we have for P=0 01, 6 ’635, so that P lies between the 5 per 

cent level of significance and the 1 per cent level. 

If, therefore, wc take the 5 per cent level as appropriate to this case, 
the results are significant ; but if we are more conservative and take the 
1 per rent level, the results are not significant. In this particular case 
the position is complicated by the relative smallness of the theoretical cell 
frequencies. 
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The additive property of 

20.24 It sometimes happens, by the repetition of experiments or other- 
wise, that we have a number of tables for similar data from different 
fields. The values of P for each may not be entirely conclusive. The 
question then arises whether we cannot obtain a value of P for the aggre- 
gate, telling us what is the probability of getting, by random sampling, a 
series of divergences from theory as great as or greater than those observed. 

The question is usually answered by pooling the results to form a single 
table. But, apart from the fact that this is not always possible, we have 
already seen (Chapter 3) that pooling is likely to introduce fallacies. A 
better method is to proceed in accordance with the following general rule. 

20.25 Suppose we have a number of groups of data, each furnishing a 
X* and a v. Add together all the x^'s to form a single value Xi*, and all 
the PS to form a single value The x* test may then be applied to Xi* 
and as if they came from a single set of cells. 

The validity of this rule will be evident when we consider how the x* 
test was arrived at. The variate x in every cell is normally distributed 
about a mean m, and Xi^ is the sum of the squares of quantities like 

just as x^ was, , This, together with the linearity of the constraints, 

which remains, was the essential part of the proof of the x® distribution, 
and hence the test remains true for X\^ ^.nd 

Example 20.9. — In Example 20.4 (inoculation against cholera on a 
certain tea estate) we saw that the x^ t^st, although suggesting that 
inoculation had some effect in immunising, did not allow us to place any 
great confidence in such a conclusion. The following data give x* and P 
for six estates, including the one we have already discussed — 

X* p 

9-34 0 0022 

6-08 0-014 

2- 51 0-11 

3- 27 0-071 

5-61 0-018 

1-59 0-21 


Total 28-40 

Here only one value of P is less than 0-01, and we might be inclined to 
doubt whether the association between inoculation and immunity is real. 
Let us, however, add the values of and of i'. We get Xi*=28-40 and 
V, s=6, there being one degree of freedom from each of the six tables. 

From Appendix Table 3 we see that this value is well beyond the one 
per cent, significance point. If we require greater accuracy, from the 
tables wc have — 
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X* P 

28 0*000094 

29 0*000061 

Whence by interpolation P =0*00008 approximately, i.e. we should expect 
to get a X* AS great as this only 80 times in a million. We can, therefore, 
regard the results, taken together, as significant with a high degree of 
confidence. 

Estfanatkm of theoretical frequmdes from tiie data | 

20.26 Our theoretical frequencies m may be calculated partly on -the 
basis of information from the data, partly on a priori grounds. Thnp, 
in the dice-throwing data of Example 20.3, our hypothesis that the dide 
were unbiased enabled us to say that the chance of getting a 5 or a 6 wal^ 
i, and hence that the chances with 12 dice were the terms in 26,306 (| -4- J)**.' 
Here we take only the value of N, the total frequency, from the data. 

In the association and contingency tables, the values of row and 
column totals, as well as N, are taken from the data and we assume 
a priori that the attributes are independent. 

It may be, however, that we draw further information from the data 
themselves in fixing the theoretical frequencies. In such cases an im- 
portant modification is necessary in the previous methods of work, for the 
number of degrees of freedom is further restricted by each piece of 
information drawn from the data, as we have already seen for contingency 
tables. 

20.27 Consider, for example, the dice-throwing data of Example 20.3. 
We have already seen that the dice were probably biased, so that the 
chance of a success was not J. What, then, was it ? 

To answer this question we can only appeal to the data. The propor- 
tion of 5’s and 6's in the total number of throws of individual dice 
(26,306x 12) was 0*3377. Let us therefore take this to be an estimate of 
the true probability. We can be confident that it will be somewhere 
very dose, owing to the large number in the sample. The theoretical 
frequencies will then be the terms in 26,306 (0*6623 +0*3377)**. 

To take a second case : consider the height distribution of Table 4.7 
page 82. We have already had reason to suspect thit this is a sample 
from a normal population. H we suppose this h}rpothesis- to be correct, 
the question arises. What is the mean and standard deviation of the 
population ? Here again we must estimate these quantities from the data, 
in the manner of Chapter 18. 

20.28 We shall denote values of the theoretical frequences whidi are 
calculated from parameters estimated from the data by the letter m', and 
the value of x* calculated from them by x'*. so that we have — 
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Now, x'* is an estimate of x* and, if the m'’s are close to the m’s, x'* will 
be close to x*. is made up of two parts, one measuring the divergence 
between theory and fact, the other due to errors of estimation of If 
the second is small compared with the first, we may expect that the x* 
test, applied with x'* instead of the unknown x*. will continue to reveal 
significant differences between theory and fact where such exist. 

20.29 The question as to the precise conditions under which the 
test is applicable for such cases has not been completely answered, but 
it has been shown that, if the cell frequencies are large, the test still 
applies subject to the following conditions — 

(a) The number of degrees of freedom must be reduced by unity for 
each constant of the population which is estimated from the data. 

(b) The estimates must be of the type known as '* efficient.” 

We shall not be able in this Introduction to go into the theory of this 
important class of estimate, but it will be sufficient if we indicate that the 
estimates of the mean of a normal population, and the parameter m of the 
Poisson distribution, are ” efficient ” if calculated in the ordinary way, 
i.e. by taking the value of the parameter in the sample to be the value of 
the parameter in the population. 

Example 20.10. — Reverting to the data of Example 20.3, let us estimate 
the true chance of getting a 5 or a 6 from the data themselves. The 
frequency of the successful event is O' 3377 of the whole. This is an 
" efficient ” estimate of the chance. The following table gives the 
observed frequencies and the theoretical frequencies calculated from the 
formula 26,306 (0 -6623+0 '3377)1*— 

TABLE 20.5. — 12 dice thrown 26,306 times, a throw of 5 or 6 reckoned a sncceu 


Number of 
successes 

Observed 

frequency 

{m 

Theoretical 

frequency 

(in') 

iR— m' 


0 

185 

187 

-- 2 

0»021 

1 

1.149 

1,146 

3 

0*008 

2 

3.265 

3,215 

50 

0*778 

3 

5,475 

5,465 

10 

0*018 

4 

6,114 

6,269 

-155 

3*832 

5 

5,194 

5,115 

79 

1*220 

6 

3,067 

3,043 

24 

0*189 

7 

1,331 

1,330 

1 

0*001 

a 

403 

424 

- 21 

1*040 

9 

105 

96 

9 

0*844 

10 and over 

18 

16 . 

2 

0*250 

Total 

26,306 

26.^ 

0 

8*201 
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Thus x*== 8-201. There are 11 cells, with one linear constraint. We 
have also fitted one constant from the data, and hence we must take 

From Appendix Table 3 we then see that P is very close to 0-50. Thus 
our hypothesis is now, so far as the is concerned, in agreement 

with experiment. 

Experiments on the distribution 

20,30 Several statisticians have conducted experiments to verify the 
theory which we have discussed in the foregoing sections. A certain/ 
amount of work in this field remains to be done, but generally it may'* 
be said that experiment supports the theory. So far as cases where the \ 
m*s are calculated a priori are concerned there is little doubt of its ^ 
correctness. 

In one set of experiments (by Yule) 200 beans were thrown into a 
revolving circular tray with 16 equal radial compartments and the number 
of beans falling into each compartment was counted. The 16 frequencies 
so obtained were arranged (1) in a 4x4 table, and (2) in a 2x8 
table, x^ calculated from the independence frequencies, as in 

Example 20.5. 

The experiment and the calculations were repeated 100 times. . The 
following table exhibits the actual and the theoretical distribution of — 


TABLE 20.6. — Theoretical distribution of calculated from independence values, in 
tables with 16 compartments, compared with the actual distributions given by 100 

experimental tables 

In the first case v must be taken as 9, in the second as 7 


H 

4 Rows, 4 Columns 

2 Rows, 8 Columns 

H 

Expectation 

Observation 

Expectation 

Observation 

0- 5 

16-6 

17 

34*0 

29*5 

5-10 

48'4 

44 

47*1 

56*5 

10-15 

260 

32 

15*3 

10 

15-20 

7*3 

6 

30 

3 

20- 

1-8 

1 

0*6 

1 

Total 

100- 1 j 

100 ! 

1 

100*0 

100 


In a second experiment with 2x2 tables 350 experimental tables of 
100 observations each were available. Table 20.7 shows the actual and 
theoretical distributions in this case. 
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TABLE 90.7. — Theoretical distribution of for a table with 2 Rows and 2 Columns, 
when is calculated from the Independence values, compared with the actual results 

for 350 experimental tables 


Value of X® 

Number of tables 

Expected 

Observed 

0 -0*25 

134*02 

122 

0-25~050 

48*15 

54 

0*50-0 -75 

32*56 

41 

0*75-1*00 

24*21 

24 

1 -2 

56*00 

62 

2 -3 

25*91 

18 

3 -4 

13*22 

13 

4 -5 

7*05 

6 

5 -6 

3*86 

5 

6- 

5*01 

5 

Total 

349*99 

350 


It is interesting to see what happens if we apply the x* test to these 
tables. 

In Table 20.6, grouping together the frequencies from X*=t5 upwards, 
so that 1^=3, X* ‘s found to be 2-27 for the 4x4 tables and 4-36 for the 
2x8 tables, giving 7^— 0*52 in the first case and 0*22 in the second. 

In Table 20.7, x^=7-53, v=9, P=0*58. 


Goodness of fit 

20.31 The x* distribution, as we have seen, leads to tests of the corre- 
spondence between theory and fact, and this and other reasons have 
led to its being described as a test of the " goodness of fit. This expres- 
sion may be used in two ways. In the first place, it may describe the 
fit " of observed and hypothetical data. In the second, it may be u.sed 
without reference to a hypothesis merely to provide an objective meM 
of estimating the merits of a particular formula or a particular curve in 
graduating a set of values or a series of points. 

The arithmetic in the second class of cases is exactly the same as m 
the first. Conventionally, we regard very low values of P as denying 
a poor fit, and moderate values as denoting a reasonably good fit. High 
values show an excellent fit, and in considering them we take no heed of 
the point discussed in 20.19 (b), since we are assessing the closeness of 
the curve to the data, not the probability that the first represents a popula- 
tion from which the second was derived by random sampling. 
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SUMMARY 



where iH refers to the observed and m to the theoretical frequencies. 

2. The number of degrees of freedom of an aggregate of ceDs is denoted 

by V, and is equal to the number of cells whose frequencies can be deter- 
mined at will. When p cell frequencies are determined, the remainder aitf 
calculable directly from the conditions to which the cell frequencies arq 
subjected by the nature of the data. ^ 

3. The frequency-distribution of x* is given by 

4. From this it is possible to ascertain the probability P that on 
random sampling we should get a value of x* as great as or greater than 
a given value. Tables have been constructed for this purpose. 

5. The X* distribution may be applied to data grouped in cells provided 
(a) that the total number N in the sample is lau’ge, (b) that no theoretical 
cell frequency is small, and (c) that the constraints are linear. 

6. The value of P for any given case enables us to judge of the corre- 
spondence between hypothesis and data. 

7. When the theoretical cell frequencies have to be calculated from 
parameters estimated from the data, the x* test can be applied with 

fn 

instead of x*. provided that the cell frequencies are large, the estimates 
are " efficient,” and the number of degrees of freedom used in ascertaining 
P is reduced by unity foi every parameter which is estimated. 

8. The value of P can also be used to give an objective criterion of the 
“ goodness of fit ” of a curve .to a set of points or of a formula to a set of 
values. 

EXERCISES 

20,1 The following table (Weldon) gives the results of a dice-throwing 
experiment : — 

12 dice tbrown 4,096 Umcs, a tbrow of 6 rt ck oned a sneccst 

Number of successes .0 1 2 3 4 5 6 7 and over Total 

Frequency . . 447 1145 1181 796 380 115 24 8 4096 
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Find X* on the hypo^esis that the dice were unbiased and hence show 
that the data are consistent with this hypothesis so far as the v* test 
is concerned. 

20.2 Perform an experiment by throwing a die 600 times and noting the 
number of points at each throw. Use these data to inquire whether the 
die is biased. 

20.3 200 digits were chosen at random from a set of tables. The fre- 
quencies of the digits were — 

Digit • . .0123456789 Total 

Frequency . . 18 19 23 21 16 25 22 20 21 15 200 

Use the x* tost to assess the correctness of the hypothesis that the digits 
were distributed in equal numbers in the tables from which these were 
chosen. 

20.4 Perform an experiment on the lines of Exercise 20.3 by taking, say, 
the last figure in 200 logarithms taken from a set of five-figure logarithm 
^^(bles. 

20.5 (Data : Yule, Jour. Anthrop. Inst. 1906, 36, 325) Sixteen pieces 
of photographic paiper were printed down to different depths of colour 
from nearly white to a very deep blackish brown. Small scraps were 
cut from each sheet and pasted on cards, two scraps on each card one above 
the other, combining scraps from the several sheets in aU possible ways, 
so that there were 256 cards in the pack. Twenty observers then went 
through the pack independently, each one naming each tint either " Ught.” 
“ medium " or “ dark.” 

The following table shows the name assigned to each of the two pieces 
of paper — 


Name assigned to 
Lower tint 

Name assigned to upper tint 

Light Medium Dark 

Total 

Ijgbt . 

850 

571 

580 

2001 

Medium 

618 

583 

455 

1666 

Dark . 

540 

456 

457 

1453 

Total 

2008 

1620 

1492 

5120 


Show that there is a significant association between the name assigned 
to one piece and the name assigned to the other. 

20.6 Apply the x* io Example 2.8, page 29, and exanune 

the justification for the conclusions there drawn. 
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20.7 Show that, if v is large, P is below the 5 per cent level of significance 

if 

V^*-V2»^>l-65 

and below the 1 per cent level of significance if 

V^*-v'2F-f>2-33 


20.8 Table 3.6, page 64, gives the number of criminals of normal and. 

weak intellect for various ranges of weight, I 

Assuming this to be a random sample of criminals, do the data supporjt 
the suggestion that weak-minded criminals are not underweight ? ^ 

20.9 Show that in a 2 x 2 contingency table wherein the frequencies are 


ajb 

c'i’ 


calculated from the " independence ” frequencies is 


(a -fc -f<i) — 6r) * 
{a+b){c+d){b+d){a+c) 


20.10 Show similarly that for a 2x« table 


i /‘If+Alf 


where /tj,, /t,, are the 2 frequencies in the rth column and N^, are the 
marginal sums of the 2 rows. 

^20.11 Two investigators draw samples from the same town in order to 
estimate the number of persons falling in the income groups " poorer,” 
“ middle class,” “ well to do.” (The limits of the groups are defined in 
terms of money and are the same for both investigators.) Their results 
are as follows — 


Investigator 

" Poorer ' 

Income group 

“ Middle Class^* 

* Welfto do 

Totals 

A 

140 

100 

15 

255 

B 

140 

50 

20 

210 

Totals 

280 

1 

150 

35 

465 


Show that the sampling technique of at least one of the investigators is 
suspect. 
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20.12 Exercise 8.17 gives the number of deaths per day of women over 
85 published in The Times during 1910-12. Using the theoretrical 
frequencies obtained in that exercise on the hypothesis that the numbers 
are distributed in a Poisson series, employ the x* test to estimate the 
correctness of this hypothesis. 

20.13 Design and execute an experiment involving the x* test to test 
pee randomness of a set of random sampling numbers. 

'20.14 (Data ; G. Mendel’s classical paper on " Experiments in Plant- 
Hybridisation ” — quoted in translation in W. Bateson’s " Mendel’s 
Principles of Heredity") 

In experiments on pea-breeding, Mendel obtained the following fre- 
quencies of seeds: 315 round and yellow; 101 wrinkled and yellow; 
108 round and green ; 32 wrinkled and green. Total 556. 

Theory predicts that the frequencies should be in the proportions 
9 : 3 : 3 : 1. 

Examine the correspondence between theory and experiment. 

20.15 A particular experiment gives, on hypothesis H, x*=9* >'=8; 
when repeated it gives the same result. Show that the two results taken 
together do not give the same confidence in H as either taken separately. 

20.16 (Data from the Registrar-General’s Statistical Review for England 
and Wales, 1941, Tables, Part II, Civil). The following figures show the 
number of births in England and Wales in 1941 by month of occurrence — 


January 

50,159 

July 

49,395 

February 

45, m 

August 

50,443 

March 

50,819 

September 

51,562 

April 

49,070 

October 

50,224 

May 

50,771 

November 

47,168 

June 

46,788 

December 

50,529 



Total 

592,813 


Use the x* test to discuss whether there is any seasonality in birth revealed 
by these data. 



CHAPTER TWENTY-ONE 


THE SAMPLING OF VARIABLES 

SMALL SAMPLES 


The prd>Iem { 

21.1 We now proceed to examine the theory of samples which are nit 
large enough to warrant the assumptions underl 3 dng the work of Chaptere 
17 to 19. In particular, it will no longer be open to us to assume (a) 
that the random sampling distribution of a statistic is approximately 
normal, or even unimodal, or {b) that values given by the data are 
suflftciently close to the population values for us to be able to use them in 
gauging the precision of our estimates. 

The removal of these assumptions imposes severe restriction on our 
work, and, as we shall see, an entirely new technique is necessary to deal 
with the problems for which they are not permissible. The division 
between the theories of large and small samples is therefore a very real 
one, though it is not always easy to draw a precise line of demarcation. 
We should point out, however, that as a rule the methods of the theory 
of small samples are applicable to large samples, though the reverse is 
not true. 

Estimates 

21.2 In the theory of large samples we were able to take as an estimate 
of a parameter in a population the value calculated from the sample as if 
it were itself the population. This procedure, obvious though it seems, 
is not in general valid for small samples. We most therefore discuss 
briefly the basis on which estimates of given parameters are to be made. 

A full investigation of this question would take us far beyond the limits 
of this book. It involves matters of considerable mathematical and 
philosophical complexity, some of which still form the subject of dispute 
among statisticians. But in the theory of small samples the main para- 
meters of interest are the mean and the standard deviation (or the 
variance), and we will proceed to consider these two. 

Estimates of tiie arithmetic mean 

21.3 We shall take as the estimate of the arithmetic mean the value 
of the sample mean. That is to say, if we have n sample values s,, 
X|, . . . x„, our estimate x of the mean in the population is 

f = is(«) 

H 

4^2 


( 21 . 1 ) 
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For estimates of the mean, therefore, the practice is the same for small 
samples as for large. 

It may be shown that for samples from a normal population an estimate 
obtained in this way is the “ best ’’ in the sense that its sampling variance 
is less than that of any other estimate of the mean. 

Estimates of the variance 

21.4 Let us denote the variance in the population by a* and the mean 
by m. 

If m is known, we take as an estimate of the variance the mean square 
deviation of the sample about m ; i.e. the estimate, which we write as s*, 
is given by 

( 21 . 2 ) 

ft 

In general, however, we do not know the value of m, which will itself 
have to be estimated. In this case equation (21.2) is no longer applicable. 

21.5 If m is the population mean and x is the sample mean, we have — 

2(,*— m)* =S(a: —*+*—»»)* 

= S(*-*)»+2:(jE-m)* 

= S(*— f)*4-f»(*— m)* 

Hence, , 

s* = -S(x— *)•+(*— >»)* 

The term -E(*— i)* is the variance of the sample. We see that 

ft 

it differs from s* by the term (*—«)*. 

Now this term will not, in general, vanish ; nor will it vanish on the 
average in a large number of cases, for it is essentially positive. Hence, 
if we take the variance of the sample to be an estimate of the variance 
of the population we shall involve ourselves in a systematic error of magnt* 
tude (*—»»)*. 

This term is the square of the deviation of the mean of the sample 
from the mean of the population, and its average value in a large number of 
samples is the variance of the mean, which we know to be equal to o*/*. 

It seems reasonable, therefore, instead of ignoring the presence of the 
term (« -»)•, to take it as equal to o* /♦». We will attempt, on this basis, 
a new estimate, which we shall write s'*. We have then— 

s'*«lll(*-f)*+^ 
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The value of a is unknown, but we may, as an approximation, write s'* 
instead. If we do so we get — 

s'* 

• (21.3) 

The effect of taking s'* given by equation (21.3), instead of the variance 
of the sample, will thus be to eliminate the systematic error of estimation 
to which w^e have just referred. | 

21.6 We may look at this in a slightly different way. Suppose w^ 
take a large number of estimates of the variance of a population compiled' 
according to equation (21.2), m being assumed known. These estimates 
will fall into a distribution which is the sampling distribution of the 
variance in samples of n. If, as will usually be the case, it is of the uni- 
modal type, w^e expect it to have a mean located at the true value of 
the variance in the population. 

Now if we take as estimates of the variance the variance of the samples 
(each about its own sample mean), the above will not be true, owing to 
the small S5'stematic shift represented by the term (i — w)* ; but it will 
be true of the estimates given by equation (21.3), and this is therefore 
a preferable estimate to take. 


21.7 Equation (21.3) was obtained by reasoning which does not depend 
on the size of n, and strictly speaking we should take it as applicable 
also to large samples. But if n is large, n and w — l are for all practical 
purposes equal. With such samples our results are true only within the 


range of the standard error, which is usually of order 


1 

Vn* 


and there is 


little point in straining after an illusory refinement by taking » — 1 instead 


of n in calculating the variance. 


From a similar point of view it might be thought that since the term 


In is generally less than the square of the standard error of the variance, 
it is equally idle to make allowance for it in estimating the variance. 
This would be true if the term were zero on the average ; but in fact it 
is not, being a biased error, and we are justified in theJong run in allowing 
for it. 


Furthermore, we may point out that the use of s'*, the corrected 
value obtained by allowing for the term a* /n, is pnly valid on the average. 
If, on random sampling, we get a sample variance greater than the popula- 
tion variance, the correction only makes matters worse, and may even 
lead to an absurd result. 


Degrees of freedom of an estimate 

21.8 In discussing the we introduced the notion of number oj 
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degrees of freedom, being the number of ceils in an aggregate whose fre- 
quency could be assigned at wQl. We may conveniently extend this 
nomenclature to estimates of parameters and particularly of variance. 

We shall refer to the divisor in the estimates of equations (21.1), (21.2) 
and (21.3) as the number of degrees of freedom of the estimates, and 
shall write it as v. Thus, v in equation (21.2) is «, and in equation (21.3) 
is n— 1. 

That this convention conforms to that adopted for the test may 
easily be seen. We saw that i> is the number of cells, that is, the number 
of terms contributing to the sum, less one for each constraint and one 
for each parameter which had been estimated from the data. In the 
quantity S(At— w)* there are n independent contributions of the t 3 q)e 
(x— m)®, and hence we may say that n is the number of degrees of freedom 
of that estimate ; but in the quantity S(x— x)® we have used the data to 
estimate x, and hence the number of degrees of freedom is lowered by 
unity, i.e. equals Ji—1. 

Test of significance 

21.9 It cannot be over-emphasised that estimates from small samples 
are of little value in indicating the true value of the parameter which is 
estimated. Some estimates will be better than others, but no estimate is 
very reliable. In the present state of our knowledge this is particularly 
true of samples from populations which are suspected not to be normal. 

Nevertheless, circumstances sometimes drive us to base inferences, 
however tentatively, on scanty data. In such cases we can rarely, if ever, 
make any confident attempt at locating the value of a parameter within 
serviceably narrow limits. For this reason we are usually concerned, in 
the theory of small samples, not with estimating the actual value of a 
parameter, but in ascertaining whether observed values can have arisen 
by sampling fluctuations from some value given in advance. For example, 
if a sample of ten gives a correlation coefficient of -f-O’ 1, we shall inquire, 
not the value of the correlation in the parent population, but, more 
generally, whether this value can have arisen from an uncorrelated 
population, i.e. whether it is significant of correlation in the parent. 

21.10 The remainder of this chapter will accordingly be devoted to a 
brief discussion of various tests of significance. Within this book we 
shall not have space to deal with these tests as fully as we should like ; but 
our account of sampling methods would be incomplete without, some 
reference to sundrsjj results of great intrinsic interest and importance in 
the field of small samples. 

The assumption of nonnality 

21.11 We have already considered one test of significance, that given 
by the distribution of x®- This is one of the simplest and most general 
tests known ; but the student will recall that it depends on the assumption 
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that the theoretical distribution of cell frequencies in each ceil is normal. 
This is justified under the conditions laid down in 20.18. 

In the tests which we shall now discuss we are similarly compelled to 
make some assumption about the nature of the parent population, although 
we shall no longer be able to lay down analogous conditions on the arrange- 
ment of the data under which the assumption is justified. We shall 
specifically assume that the parent population is normal unless otherwise 
stated. 

21.12 Our results will, therefore, be strictly true only for the normal 
population. Some experiments have been made to throw light on thje 
question whether they are true for other types of population. It appea^ 
that, provided the divergence of the parent from normality is not tob 
great, the results which are given below as true for normal populations are 
true to a large extent for other populations. Theoretical work confirms 
that the results remain true for populations which do not deviate 
markedly from normality ; but if there is any good reason to suspect that 
the parent is markedly skew, e.g. U- or J-shaped, the methods of the 
succeeding sections cannot be applied with much confidence. 

21.13 We may direct attention to one further point on which caution 
is necessary In the theory of large samples we recommended the student 
to base his conclusions on a range of six times the standard error, and 
pointed out that for normal populations the probability of deviations from 
the true value outside this range was less than 3 in 1 ,000. One can feel 
great confidence in conclusions supported by probabilities of this order. 
But in the theory of small samples it is, as a rule, necessary to use larger 
probabilities, say, of one in 20 or one in 100, e.g. the 1 per cent and 5 per 
cent levels of P in the x* f^st. The force of inferences based on prob- 
abilities of this order is not so great as before, and the student should bear 
this fact in mind. 

21.14 For a known parent population, and in particular for a normal 
parent, it is not difficult to find expressions for the random sampling 
distribution of the commoner statistics such as the mean and standard 
deviation. But these distributions, even when mathematically tractable, 
will in general contain certain parent values. For instance, the sampling 
distribution of the means of samples of n from a normal population with 
mean m and standard deviation u is also normal with mean m and standard 
deviation afy/n. In the cases which we wish to consider, n is not large 
enough for us tp take estimates of m and a from the sample to find the 
sampling distribution to any close degree of approximation. 

It is, however, a remarkable fact that we can construct certain statistics 
whose sampling distributions are either independent of, or dependent 
on only one of, the constants of the parent. We will proceed to consider 
two important distributions of this kind, the so-called f-distribution, due 
to Student," and the s-distribution, due to R. A. Fisher. 
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Hie ^distribution 

21.15 Writing, as before. 
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* = Js(*) 

let us define a new statistic t by the equation 

t = -j^Vv+l (21.4) 


where i»=n— 1 and m is the mean of the population. 

We shall refer to j> as the number of degrees of freedom of t. 

Then it may be shown that, for samples of n from a normal population, 
the distribution of t is given by 

(21.5) 

('+t) ’ 

21.16 We will imagine chosen so that the area of the curve given 
by equation (21.5) is unity. Then, precisely as for the x* distribution, 
the probability Pg that, on random sampling, we shall get a value of t not 
greater than some value is the area of the curve to the left of the ordinate 
at the point We may write this 


3 

8 




( 21 . 6 ) 


h 


Similarly, the probability that we get a value of t between the limits 
and t, is given by 


o _ 

8 — 



(21.7) 


Form of “ Student’s ” distribution 

21.17 The curves given by equation (21.5) are easy to study. Clearly 
they are symmetrical about <=0, since only even powers of t appear in th«r 


equation. 


Further, since 



decreases as t increases, the curves 


will have a mode (coinciding, of course, with the mean) at / » 0, and will 
tail off to infinity on each side. They will, in fact, be symmetrii^ single* 
humped curves rather like the normal curve, only more Iqitoknrtic. 
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As p tends to infinity, tends to e and hence I is dis- 

■ 

tributed normally. Ihis fact enables us to use the tables of the normal 
integral to evaluate P approximately when p is large. 

21.18 At the end of this book we reproduce by permission tables of 
the integral (21.6) calculated by “ Student himself (Appendix Table 4). 
These have been reduced to three places of decimals from the original four. 
Tables of rather a different form have been given in the Fisher-Yates/ 
Statistical Tables and in Tables for Statisticians and Biometricians, Part /,! 
and to avoid possible confusion we point out where these tables differ. \ 
Tables for Statisticians, etc., gives the values 'of 

, _ ^ 

r v+i 

■'“‘■( 1 + 2 *) 

where for p from 1 to 9. These values (which were also calcu- 

lated by Student *') are of the same kind as, but more limited in range 
than, those of our table. 

The Fisher-Yates tables adopt the standpoint we have already noticed 
in discussing the distribution (Chapter 20), and gives values of i 
corresponding to various values of p and the 5 per cent and 1 per cent 
levels of a third probability 

Pg and Pjf, are simply related. Pg is the probability that an observed 
value will not exceed P^ is the probability that an observed value of t, 
regardless of sign, will exceed t^. 

Hence, 

Pg = Area of curve to the left of ordinate 
P, == Area to right of t^ + area to left of —t^ 

= 2 (Area to right of ifo) (since the curve is symmetrical) 

= 2(1~P,) (21.8) 

The student should keep these relations in mind, particularly when 
thinking of levels of significance. In the sense of Fisher and Yates, a 
value of Pf will fall below the 5 per cent level if Pj, is less than 0*05. 
This implies that Pg is greater than 0*975, not 0*95.^ 

^ A comparison of the tables is not made any easier by the fact that *' Student " 
and Fisher use n to denote the degrees of freedom, whereas Tables for Staiisiicians uses 
it to denote the number in the sample. It is probable that future editions of Tables 
more complete tables for the peroesitage points of t. 

The distinction be^een Pg and Pp did not arise in Chapter 20 beca*ise v* is essentially 
positive. 
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AppUcatfons of Student’s ” distribution 

21.19 We proceed to give one or two examples of the way in which 
the " Student ’’ distribution is generally used to test the significance of 
various results obtained from small samples. 

Example 21.1.— Ten individuals are chosen at random from a popula- 
tion and their heights are found to be, in inches, 63, 63, 66, 67, 68, 69, 
70, 70, 71 and 71. In the light of these data, discuss the suggestion 
that the mean height in the population is 66 inches. 

In the first place, let us note that the population is likely to be approxi- 
mately normal, from our knowledge of height distributions, and the 
samp^g is random. 

In the sample we find that 


and, 


i =67*8 inches 
s' = 3‘011 inches 


Let us now calculate t from equation (21.4), taking m to be 66 inches. 
We have— 


t = 


67-8-66 

3-011 


VlO *= 1 '89 


From the Appendix Table 4 (column v = 9)— 


Hence, 


for< = l-8, P = 0-947 
fori = 1-9, P = 0-955 

forf = l- P = 0-954 


Thus the chance of getting a value of I greater than that observed is 
1 —0-954, i.e. 0-046, or about one in twenty. The probability of getting t 
greater in absoltUe value is 0-092, or about one in ten. We should hardly 
regard this as significant ; but if we did, we should argue that as the 
observed value of t is improbable, the initial assumptions on which we 
obtained it were incorrect ; and this in turn suggests that there is some 
doubt about the true mean being 66 inches. 

Example 21.2.— (Voelcker’s data quoted by " Student," BiomOrika, 
1908, 6, 19.) 

Voelcker grew certain crops of potatoes dressed (a) with sulphate of 
potash, and (b) with kainite. In four experiments, two of each of 1904 
and 1905, the differences in 3 rields per acre (sulphate plot less kainite 
plot) were— 


0-5464 ton 
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This suggests that sulphate of potash is a better manure than kainite. 
Required to discuss the question. 

From our knowledge of crop yields we expect them to be distributed 
in a unimodal form not very far removed from the normal. Let us 
suppose that the two manures have the same effect on 5 deld. Then the 
differences of plots will be distributed in an approximately normal form 
about zero mean. 

The mean of the four differences is 0'7626 ton, and we find s' =0* 5312. 

0-7626-0 

= 2-871 I 


From the tables, for v=3, P =0-968 approximately. 

Hence the chance P of getting a value of t greater than that observed 
is about 1 in 33. The chance of getting a value greater absolutely than 
the observed value is 0-06. If we choose to regard this as significant, 
we are led to suspect our hypothesis that the two manures exert equal 
influences on yield, and hence to suppose, though with little confidence 
so far as these data are concerned, that sulphate of potash is the better 
manure. 


21.20 The student who wishes to apply the ^distribution for himself 
is advised to make a careful study of the logic of the argument under- 
lying the inferences we have drawn in the foregoing two examples. 

In Example 21.1 we saw that the chance of getting a value of t less 
than 1 -89 is approximately 0-954. This is not the same thing as saying 
that the probability of a deviation in the sample mean of 1 - 8 inches or 
less is 0.954. In fact, we do not know this probability, and the smallness 
of the sample prevents us from approximating to it with any closeness. 
It might happen that a in the population was such that a deviation of 
1-8 inches was not at all improbable. The relative improbability of i 
would then be due to deviations of s' fropi o. 

ConqKurison of two sanqiles 

21.21 Suppose we have two samples x^, x, . i . and x/, x', . x’,^. 

Let us. as l^fore, define 

X, =j-2(xl ^ 


X. 2 (*') 


(21.9) 
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Let os further define 


If the two samples come from the same population, s'* will be an estimate 
of 9 *. It has, as we might expect, nj4-M|— 2 degrees of freedom, since 
both Xi and £, are calculated from the data. 

Let us write 

f = «i+«i-2 .... (21.11) 

and define 


i = 


$ 


Xl-Xs 


■_ Xt-St I ”1”! 
s' 'V«l+«* 


( 21 . 12 ) 


Then it may be shown that t, as so defined, is distributed according to 
the form of equation (21.5) with v degrees of freedom. 

Example 21.3. — (Data from R.A. Fisher, Metron, 1925, 5, 95.) 

Eight pots growing three barley plants each were exposed to a high 
tension discharge, while nine similar pots were enclosed in an earthed wire 
cage. The numbers of tillers in each pot were as follows — 

Caged . . . 17. 27. 18. 25, 27, 29. 27, 23; 17 

Electrified . . 16, 16, 20, 16. 20, 17, 15) 21 


We are interested in the question whether electrification exercises any 
real effect on the tillering. 

We find 

Xi * 23-333 St = 17-625 


15 


Xi—Xf — 5-708 
221-875 = 14-7916 s' =3-846 




5-708 18x9 _ 

5-846V 17 


3-05 


V = 8+9-2 = 15 


From the tables we find that P, = 0-996. 

Hence, if the samples came from the same population they furnish a 
value of / which is improbable— an absolutely greater value would arwe 
only 8 times in a thousand. We therefore suspect that the populations 
are different, i.e. that electrification does exert some effect on the tillering. 
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21.22 In applying the ^-distribution to two samples as in the preceding 
example one further point should be borne in mind. It does not follow 
from a significant value of t that the samples come from populations which 
have different means. Samples from two populations with the same 
means and different standard deviations would also furnish significant 
<’s on occasion. We can test whether this is so by the method of 21.27 
below. 

Significance of regression coefficients 

21.23 From (21.4) it is clear that “Student’s” f is a ratio, being, apart 
from constants, the ratio of the estimate of the sample mean (measured 
from the parent mean) to the estimated standard deviation. The 
simplicity of its sampling distribution (21.5) arises from the fact (which 
we state without proof) that in normal samples, and only in normal samples, 
sampling variations of the mean are completely independent of (and not 
merely uncorrelated with) those of the variance. 

There are other cases in which we find a quantity which is the ratio 
of two independent variates, the numerator distributed like a mean and 
the denominator like a standard deviation in normal samples. In such 
cases, of course, the ratio t follows “ Student’s ” distribution. The most 
important, perhaps, is that of regression coefficients. 

21.24 Consider a linear regression equation — 


y=Px . . (21.13) 

where y, x are measured from their means and fi is the parent value of the 
r^ession. We will assume that for any fixed x the distribution of y is 
normal as, for instance, is true if the joint distribution is normal. The 
corresponding sample regression equation will be — 

y-y = b(x-x) (21.14) 

Then if s^*, s,* are the sample variances of x and y respectively it may 
be shown that — 


_ (6-/?)s. V(n -2) 


(21.15) 


is distributed in " Student’s ” form with v — n —2 degrees of freedom. The 
result derives from the fact that {b—fi)si is distributed like a mean in 
normal samples whereas (St*—&*Si *)/(«— 2) is distributed independently 
like a variance. It is, in fact, an estimate of the variance of the residuals 
of observed values. about the regression line — cf. 9.24. 

The expression for t in (21.15) does not involve any (4 the parent para- 
meters except p and consequently it may be used to test the significance 
of P irrespective of the other parameters. 
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Example 21.4. — In Table 13.1 (page 311) we gave some data for the 
yields of wheat and potatoes in 48 English counties. The regression of 
y (potato yield) on x (wheat 3 deld) is found to be — 

y-6-065 = 0-0783 (*-15-791) 

The value of the regression coefficient is small. Could it have arisen by 
chance in a sample from a population for which y9==0 ? 

We find— 

6=0- 0783, V(« -2) = = 6 - 7823, 

Si*= 4-1749, s,*= 0-5340. 

Hence, from (21.15) — 

< = 2-06, v = 46 

Appendix Table 4 does not carry us as far as i'=46. For large v, t tends 
to be distributed normally with zero mean and unit variance and a normal 
deviate of 2-06 would be significant at the 5 per cent level but not at 
the 2 per cent level. The regression is of doubtful significance. 

More accurately, from the Fisher- Yates Tables we find the following 
values of t for P=0-05 — 

v=40. < = 2-021 »'=60 < = 2-000 

and for P = 0-02 — 

y = 40, < = 2-423 : V = 60 < = 2-390 

This confirms our result that the observed < is significant at the 5 per 
cent but not at the 2 per cent level. 

21.25 We have remarked in 19.31 that the significance of a value of 
Spearman’s rank correlation-coefficient can be tested by the use of 
“ Student’s ” distribution ; and we shall see later (21.34) that the product- 
moment correlation can also be tested in the same way on the hypothesis 
that there is no parent correlation. These facts are to be regarded as 
mathematical accidents. They do not depend on the properties of 
" Student’s ” < as a ratio, but on the fact that the <- distribution, being 
a symmetrical unimodal distribution which tends to normality, may be 
used as an approximation to other distributions of the same kind. 

Fisher’s distribution 

21.26 Suppose that we have two samples, as in 21.21 with variances 

and s,*. Then if the samples come from the same normal population 
the distribution of the ratio may be shown to be — 




. (21.1Q 
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where i may have any value from 0 to oo . This may be put in a rather 
different fonp. In terms of the estimated variances s^ and s^\ write — 

* = log,^S = ilog.j2* . . . .(21.17) 

Then it may be shown that in normal samples from the same population 
z is distributed according to the law 


where 


y 


e'>i> 

(21.18) 

V, = tlt — l 

(21.19) 


As usual, we take so that the area of the curve is unity, and the 
probability that we get a given value or greater on random sampling 
will be given by the area to the right of the ordinate at z^. 


21.27 This probability is not easy to tabulate owing to the fact that 
it depends upon the two numbers and Fisher has therefore pre- 
pared tables showing the 5 per cent and 1 per cent significance points of r, 
and a further table of the 0 • 1 per cent points has been given by Colcord and 
Deming. These tables are reproduced by permission in Appendix Tables 
6A, 6B and 6C. For practice purposes they are sufficient to enable the 
significance of an observed value of z to be gauged. If the exact value of 
the probability of obtaining a given value of z or greater is required, use 
may sometimes be made of the tables of the incomplete beta-function. 
Tables are also available for the values of the variance ratio itself 
corresponding to specified probability levels. The quantity z of (21,17) 
was used by Fisher instead of the ratio because linear inter- 

polation is more accurate in the z-tables. The 5 per cent, 1 per cent 
and 0*1 per cent points of the variance-ratio F are given in Appendix 
Table 5. 


Example 21.4. — Consider again the data of Example 21.3. 

Here, as always, it is convenient to take the suffix 1 to refer to the 
larger of the two estimates of variance. 

We have — 


s/« = 23 

s,'« = = 5*4107 


* * 5 *4107 

= 0*724 



THE SAMPLING OP VARIABLES 


495 


From Appendix Table 6A we see that for these degrees of freedom the 
5 per cent significance value of « is 0 • 6576. From Table 6B the 1 per cent 
value is 0-9614. 

The observed z lies between these two and is thus of rather doubtful 
significance. 


s '• 

Alternatively F = Jjj = 4-25 and from Appendix Table 5A and 5B 

we see that the 5 per cent and 1 per cent points are 3-73 and 6-84, leading 
to the same conclusion. 


21.28 We shall consider this distribution and some of its uses in the 
next chapter (Analysis of Variance). At this stage we may note that, 
since it contains no unknown parameters, it provides a significance test 
for the ratio of any two independent variates each of which is distributed 
like a variance in normal samples. The distribution of a variance (or 
equivalently, of course, of a standard deviation) is, in fact that of x*. so 
that z may be regarded as the distribution of the logarithm of the ratio 
of two independent variates each of which is distributed as x- 

Correlation coefficient in small samples 

21.29 Although the distribution of the correlation coefficient in samples 
from a bivariate normal population tends to the normal form as the 
size of the sample increases, a fact which justifies the use of the standard 
error for large «, the distribution diverges very remarkably from the 
normal when n is small, and even when n is moderately large if the correla- 
tion in the parent population is high. Further investigation is therefore 
necessary before we can assess the significance of correlation coefficients 
obtained from small samples. 

21.30 The distribution of the correlation coefficient in samples from 
a bivariate normal population was obtained in an exact form by R. A. 
Fisher in 1915. Ordinates of the frequency-curves which give the 
distribution have been worked out for various values of « and p, the 
correlation in the population, and are tabulated in F. N. David’s TaMes 
ofOte Correlation Coefficient. The general form of these curves is illustrated 
in fig. 21.1, which shows the curves for p=-|-0-6 and various values of ». 

A glance at this figure will show that even for a moderate value of p, 
such as -fO-6, the distribution of the coefficient is U-shaped for «=3, 
and, although unimodal, distinctly skew to the eye even for n =20. For high 
values of p, such as -1-0 -9, the distribution is skew for higher values of n. 

As a result it is safe to say that the values of correlation coefficients 
calculated from samples of less than five will throw no light on the existence 
of correlation in the population. For samples of 20 or 30 we cannot 
apply the standard error with much confidence if the correlation in the 
population is likely, to be very high, whether positive or negative. 50 
seems to be the minimum number in the sample for the application of 
the standard error if /> is very high, and 100 is ^er. 
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21.31 The equation giving the distribution of the correlation coefficient 
is very complex, but Miss David's tables referred to above give the areas 
under the frequency curves for various values of n, p and r. These tables 
may be used to assess the significance of an observed value of r from a 
bivariate normal population. For most practical purposes, however, use 
may be made of a method due to K. A. Fisher, the essence of which is the 
transformation of the distribution of r into a new distribution which is 
approximately normal. 



Fig. 2LL — Firequency ditMlnttioii of the correlation coefficient in tanqrics from a 
■otmal poimlation wltti correlation +0*6 for various values of the number la the 

sanqrlc n — 

In each case the total frequency, i.e. the area under the curve, is unity 

21.32 Before we discuss this process, however, it is desirable to point 
out the degree of applicability of our results. 

(1) In the first place, it has been shown that the distribution of partial 
correlation coefficients in samples of n is of the same form as that of total 
corrdation coefficients in samples of n—p, where ^ is the number of 
itnnitry subscripts in the partial coefficient. 
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(2) Secondly, our results are strictly true only for normal populatunMi. 
There is some experimental evidence to show that they are' true for all 
practical purposes even if the parent is moderately skew but remains of 
the unimodal t 3 rpe ; but if there is any reason to suppose that the parent 
is J- or U-shaped according to one or more variates, the student should 
draw his conclusions with the utmost reserve. 


Fislwr’s transfocmatioii 

21.33 If r and p are the correlations in the sample and the population 
respectively, let us put 


So that 


f = tanh X p = tanh ^ 



. ( 21 . 20 ) 


Tlien it may be shown that r is, to a close approximation, distributed 

normally about meaii C with standard deviation j . 

Vn— 3 

In fact, the mean of r is given by 


and, for the r-distribution 

A = 5^. +‘^ jrV*’ 

A - 3 + ^ +ten»s in etc. 


. ( 21 . 21 ) 


. ( 21 . 22 ) 
. ( 21 . 2 ^ 


For n=sll, say, /!i is of the order of 0*001 even if p is high, which riiows 
how closely the z-^tribution lies to the symmetrical ; and /Sg—S is of the 
order of 0* 2, which* shows that the distribution has nearly normal kurtosis. 
In such a case z would difier from C 0*05, which is not large, but 
might be important in some cases. The standard error of z is, however. 


— 7^—, and the factor — ^ , 
Vh-3 2(»— 1) 


may, as a rule, be neglected in comparison. 


This is the basis of the statement above that z is normally distributed 
about mean ([. 

We now give some examples of the use of the z-transformation in 
testing the significance of an observed r. 


* This z is to be distiiigaishsd from the t of Fisher’s distiibatkn of BUI, 
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Example 21.5. — In Example 9.1, page 223, we found that the correlation 
between the price indices of animal feeding-stufis and home-grown oats 
is 0*68, the sample consisting of 60 members. 

This sample is large enough for us to use the standard error. If we do 
so we get 


a. 


l-(0^)> 

V60 


= 0-07 approximately 


The correlation thus is undoubtedly significant. 

We might, alternatively, use the z test, thus, to answer the question, 
“ Could the observed value have arisen from an uncorrelated population? " 
On this hypothesis 

p = 0 and ? = 0 

We have — 

1 1 1-68 
* — i log. 0-32 


I 

i 


\ 




= 0-829 


The standard error of z is 


1 


0-13. 


The deviation of z from ^ is more than six times this, and we conclude 
that our hypothesis was incorrect, i.e. that the population is correlated. 

Example 21.6. — Continuing the previous example, could the observed 
correlation have arisen from a population in which />=-f0'8 ? 

Here 

f = ilog.[+^ = 1-099 


The deviation of 2 from f is, therefore, 

1 •099-0-829 = 0-270 

This is about twice the standard error of 2 . It might arise, though 
rarely, as a sampling fluctuation, and we conclude that pis likely to be less 
than -1- 0-8. 

Example 21.7. — In Example 12.1, page 290, we found a partial correla- 
tion of -0-73 (38 unions) tetween earnings of agricultural labourers and 
the percentage of the population in receipt of relief, when the ratio of 
numbers in receipt of outdoor relief to those relieved in the workhouse was 
constant. Is this significant, and can it have arisen from a population in 
which the real correlation is -0-667 ? 
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Here 


* = i log. 


0- 27 

1- 73 


= -0*929 

^ for an uncorrelated population = 0 
c ifp = -0*667 = i log. ^ 


= -0*805 

There is one secondary subscript in the partial correlation. Hence, the 

standard error of x = ■^^ =0*1715. 

If C=0, the deviation is more than five times the standard error and 
is undoubtedly significant. If />=— 0*667, the deviation is less than the 
standard error and hence may very well have arisen from sampling 
fluctuations. 

Application of “ Student’s ” distribution to correlation coefficients 
21.34 The test we have just given is of general application, but it is 
worth noticing that if p=0, the distribution of the correlation coefficient 
in small samples from a normal population may be tested by the *' Student” 
distribution. 

In fact, the distribution of the correlation coefficient assumes a par- 
ticularly simple form for such uncorrelated populations, namely, 

y=^yo{l-r^'^ (21.24) 

If we put 

t = (21.25) 

then it may be shown that t is distributed in the " Student " form with 
n— 2 degrees of freedom, and its significance may be tested accordingly. 


SUMMARY 

1. As an estimate of the mean of the population we may take the mean 
of the sample, w'hether large or small. 

2. If the mean of the population is known, we may take the mean 
square deviation about that mean as an estimate of the variance of the 
population ; i.e. the estimate is giwn by 

s* ■* ^S(*— «*)• 
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3. If the mean of the population is not known, a preferable estimate of 
the population variance is the " corrected ” variance of the sample, given by 

c'l J* 

n— 1 ' 

4. This estimate is said to have «— 1 degrees of freedom. 

5. In samples from a normal population the parameter t, given by 


where i'=»— 1, is distributed according to the law (due to '* Student ’’) 


y 


1 + 


K+l 


This distribution may be used to give the probability of getting a value 
of t between specified limits on random sampling. 

6. With two samples, *,, . . . Xn^ and Xn'^, from the same 

normal population, the parameter t defined by 


Xi-Xj I 

s' V«l+M» 

where 

*’* and V = «i+na-2 


is also distributed according to the above law, with v degrees of freedom. 

7. With two samples, as before, with estimated variances 


s,'*= 

s 1 s ^ 

the parameter z = log - — ^ log 

is distributed according to the law (due to R. A. Fisher) 


y ~ytr yi+y. 

where 

1, »», = 1 

As usual, this distribution may be used to give the probability of 
getting a value of z between specified limits on random sampling. 
Alternatively tables are available for testing directly the ratio — 

F r= sJVs? =» «** 
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8. The distribution of the correlation coefiBicient in samples from a 
normal bivariate population is not normal. However, putting 


i iog. 


l+r 

l-r 


C = iiog, 


l+p 

l-p 


where p is the correlation in the population, it may be shown that 2 is 
approximately normally distributed about ^ with standard deviation 


n being the number in the sample. 


9. This result remains true of partial correlation coefficients, but in 
the above formulae n must be taken to be the number in the sample less 
the number of secondary subscripts in the coefficient tested. 

10. In samples from an uncorrelated normal population the distribution 
of r is given by 

» t~4 

y =3'c(i -»■*)'* 

The statistic t, defined by 

f 


is distributed in the " Student form in such cases with n— 2 degrees of 
freedom. 


EXERCISES 

^1.1 Find " Student’s ” t for the following variate values in a sample of 
10: —6,— 4,— 3,— 2,— 2, 0, 1, 1, 3, 5, taking m to be zero, and find from 
the tables the probability of getting a value of / as great or greater on 
mndom sampling from a normal population. 

*'^21.2 A farmer grows crops on two fields, A and B. On A he puts £l 
worth of manure per acre and on B £2 worth. The net returns per acre, 
exclusive of the cost of manure, on the two fields in five years are — 


Year 

Field A, £ per acre 

Field B, £ per acre 

n 

17 

18 


14 

16*5 


21 

24 


18*5 

19 

u 

22 

25 
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Other things being equal, discuss the question whether it is likely to 
pay the farmer to continue the more expensive dressing. State clearly 
^he assumptions which you make. 

21.3 The heights of six randomly chosen sailors are, in inches : 63, 65, 
68, 69, 71 and 72. Those of ten randomly chosen soldiers are : 61, 62, 
65, 66, 69, 69, 70, 71, 72 and 73. Discuss the light that these data throw 
on the suggestion that soldiers are, on the average, taller than sailors. 


21.4 In the data of Exercise 21.3, use the ^ar-distribution to discuss 
whether the samples can have come from populations which are identical 
^o far as height distribution is concerned. 

21.5 In three samples of 50 lines each from Shakespeare's “ Romeo and 
Juliet " (an early play), the following numbers of weak endings were 
observed : 7, 9, 10. In three similar samples from “ C5anbeline " (late), 
the numbers of weak endings were 15, 11, 12. Discuss the suggestion 
that Shakespeare's prosody, as judged by the number of weak endings, 
changed with advancing years. 

21.6 A random sample of 15 from a normal population gives a correlation 
coefficient of — 0*5. Is this significant of the existence of correlation in 
the population? 

21.7 Show that in samples of four from an uncorrelated normal popula- 
tion all values of the correlation coefficient are equally probable ; and that 
for samples of less than four a zero coefficient is the most improbable. 


21.8 What is the probability that a correlation coefficient of H-0’75 or 
less can arise in a sample of 30 from a normal population in which the 
true correlation is +0*9 ? Compare this with the result given by assuming 


the sampling distribution normal with standard deviation 


l~r* 


21.9 Test the significance of the partial correlation coefficients of 
Example 12.1, page 290. 

21.10 Show that in samples of 25 from an uncorrelated normal popula- 
tion the chance is 1 in 100 that r is greater than about 0*43. 

21.11 If two statistics both have the same dimensions show that their 
ratio must be independent of the scale of the parent population. Hence 
consider why ** Student's " t and Fisher's z (variance-ratio) are indepen- 
dent of a, the standard deviation of the normal parent. 

21.12 By considerations similar to those of the previous exercise show 
that in normal samples the distribution of the correlation coefficient 
cannot contain either the parent means or the parent variances, but only 
the parent correlation. 
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THE ANALYSIS OF VARIANCE 


22.1 In this chapter we shall consider a technique of analysis which is 
of wide application whenever samples of variate data can be classified 
in groups. For instance, we may have a sample which consists of p 
sub-samples, our interest lying in the question whether the total sample 
may be regarded as homogeneous or alternatively whether there is some 
indication that the sub-samples were drawn from different populations. 
Again, we may have a number of plots of a cereal grown under different 
manurial treatments. Our interest here is whether the manures exert 
any differential effect on 3 rields ; and if we classify the 3 delds into groups 
according to the type of fertiliser applied we have the case, already 
mentioned, of p sets of data which we require to test for homogeneity, p 
being the number of different treatments. To take a more complex 
case, we may have a number of observations taken by p different observers 
each on a sample affected by q different effects, as for instance, if p labora- 
tory assistants carry out an assay on samples of a drug from q different 
suppliers. Our classification here is two-fold and we wish to discuss 
whether there are any significant diffesences between the q sources of 
supply and, independently if possible, whether there are any differences 
between the results obtained by the p assistants. 

In general we desire to answer the question whether some one variable, 
treated as dependent variable, does or does not exhibit heterogeneity 
when classified into “ arrays ”, " families " or " classes " by one or more 
independent variables, 

A tingle independent variable 

22.2 We shall discuss in the first instance the simplest case of a single 
classification (i.e. according to one independent variable) and shall 
proceed to the more complex cases later. 

Suppose then that we have a set of variate- values divided into p families, 
the number in the yth family being %. We may array the values thus— 


First family 


Second family 


^th family 
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Let US denote by x„ the mean of the whole set and by the mean of 
the jth family. This is a new notation which will be very convenient 
for later generalisation, a period replacing any subscript which is averaged. 
Then, denoting by £ summation over values of i from 1 to and values 

of j from 1 to we have the simple algebraic identity 

(*«-*..)* = 2 {Xa-Xj+Xj-X_)* 

a a 

= 2: (*<,-*, )*+2S {xu-xJ(xj-xJ 

a 

+ ( 22 . 2 ) 

Now if we carry out summations over i alone we have 

2 (%-*.<)(*/-* .) = = 0 
< i 

since, by definition, Xj is the mean of x^^ in the ^th family. Hence we 
have, from (22.2) 

a a ^ 

= . . (22.3) 

a y 

22.3 This is a fundamental identity and we pause to examine its meaning. 
The expression on the left in (22.3) is the sum of squares of all values t|ken 
about their mean, a quantity which we shall call the deviance. If the 
total number of observations (=Efi^) is N, the deviance is N times the 

y • 

variance of the total number of observations, and no confusion will arise 
if we call it the total deviance. 

The first term on the right in (22.3) is the sum of the deviances of each 
family. Regarding the sum of squares of deviations from a mean as a 
measure of variability, we may regard this term as expressing the variation 
within families. On the other hand the last term on the right in (22.3) 
is the sum of squares of means of families about the total mean and may be 
regarded as expressing the variation between families. Thus we have 
analysed the variation of the whole group into two parts, one expressing 
variation within families, the other expressing variation from family to 
family. 

22«4 Strictly speaking, perhaps, we ought to call this process an analysis 
of deviance, but it has become known as the analysis of variance. In 
the particular case when all families contain the same number n, (22.3) 
simplifies in a way which exhibits how this term came into use. For thoi 

S (%— X )* N var x ^np var x 
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and hence, on substitution in (22.3) 

(22.4) 

Now if we write s* for the variance of the whole, s^* for the variance of 
the p family means and for the variance within the/th family, we shall 
have 

s* = l2(s,*)+s«« (22.5) 

r i 

Our total variance is then expressed as the sum of two components, a 
mean of the variances within families and the variance of the means of 
families. 

22.5 Equation (22.5) should be compared with equation (18.13) to which 
it is formally equivalent. Our discussion of the sampling variation in 
non-simple sampling was, in fact, a form of variance-anal}rsis. The 
effect of sampling from the parts of a " patchy '' population is to increase 
the variance by an amount equal to the variation of the means of patches 
among themselves. 

22.6 Now let us suppose that the p families from which our samples were 
drawn are not different, i.e., that the data are homogeneous. Then the 
variance of the whole sample will give us an estimate of the (common) 
palhnt variance v. If iV is large it makes no practical difference whether 
we use the actual variance of the sample or the alternative estimate of 
(21.3) obtained by dividing the deviance by JV— 1 ; but there are practical 
as well as theoretical reasons for using (21.3) when the sample is small, 
and we shall use it in all cases ; that is to say, we shall base our estimates 
of the variance on the appropriate number of degrees of freedom (21.8). 
An estimate of the parent variance v is then given by 

^i:(*«-x,)« .... (22.6) 

But this is not the only estimate we may derive from the data. On 
our hypothesis as to homogeneity, the deviances within families provide 
an estimate when divided by the appropriate number of degrees of freedom. 
Thus a second estimate is given by 

N—p ^ • • • • (22.7) 

Finally, the means Xj are distributed with variance w/% in ^ue erf 
(18.8) and it may be shown — ^we most omit the proof — that a third 
estimate of v is given by 




. ( 22 . 8 ^ 
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22.7 Examination of (22.6), (22.7) and (22.8) will show that the various 
numerators are the items entering into (22.3), while the degrees of freedom 
forming the denominators are also additive, i.e. — 

N-l=^(N-p)+(P-l) 

We may therefore exhibit our estimates of v in the form of a table as 
follows : 

TABLE 22.1. — Form of varlance^analysls for a ringle Independent variable 


(J) 

Deviances relating 
to variation 

(2) 

Degrees of 
freedom 

(3) 

Deviances 

(4) 

Estimates of v 
(Column (3) divided 
by column (2)) 

Between families 

p-i 



Within families 

N-p 

^ j)* 

Bij,! 

Total 


1 



This convenient lay-out enables a check to be made in arithmetical 
examples from the fact that in columns (2) and (3) the value at the foot 
is the sum of values in the body of the table. This is not, however, true 
of column (4). 

22.8 Now suppose that we have carried out such an analysis for a 
particular arithmetical case and derive three estimates v^, and V| of the 
parent variance v. If these three values are in reasonably close agreement 
we see no reason to reject the h 3 ^othesis that the families all come from 
the same population, that the data are homogeneous, or that there are 
no real differences between family means. On the other hand, if the 
estimates are different (and significantly so in a sense we shall discuss 
below) we may reject the hypothesis of homogeneity and conclude that 
there exist red differences between some or all of the fagiilies. 

22.9 To make the argument satisfactory we require some criterion to 
decide when the various estimates are significantly different. This brings 
us to the second fundamental feature of variance-analysis. If the popula- 
tion is normal the two estimates of variance derived from variation within 
^d between families are independent and their ratio is distributed, 
independently of the actual value of the parent variance, in the form 
of (21.16) and hence may be tested in Fisher's r-distribution (21.26), or 
the equivalent F- or variance-ratio distribution. 
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Note that neither estimate of variance can be independent of the 
estimate derived from the total variance, for the latter incorporates them 
both. Our si^ificance test must relate to the ratio of variation between 
classes to variation within classes. 

22.10 We shall not present a proof of the results stated in the “previous 
section but the following line of reasoning will indicate how such a proof 
may be derived. For normal populations as we have stated, the mean is 
distributed independently of the variance (19.17). On the h 3 ipothesis 
of homogeneity, the means of families are therefore independent of the 
variances within families ; and consequently the estimate between families, 
which is derived solely from the means, is independent of the estimate 
within families, which is obtained by pooling the deviances within families. 
Hence the fact of independence. That the estimates are distributed like 
variances follows from an elaboration of the consideration that the mean 
of normal samples is also normally distributed, so that the variance 
between families is like a variance of a normal sample ; whereas the 
variation within families is the sum of deviances and, like x*« is additive 
in the sense that its total is distributed like a constant multiple of a 
variance. 

We proceed to consider two examples, one for large and one for small 
samples. 

Example 22.1. — ^The following table (from the Registrar-General’s 
Statistical Review of England and Wales for 1933, Part II) shows the 
numbers of males married in England in that year classified according to 
age and district. (Certain small numbers of unspecified age and those 
under 21 have been omitted). Note the changes of interval at 25- and 
35- years. 

TABLE 22.2 





Aa:e (years) 




District 

21- 

25- 

30- 

35- 

45- 

55 and 
upwards 

Totals 

South-East . 

31,714 

43,979 

14,995 

7,985 

3,928 

3,717 

106,318 

North 

31,507 

39,849 

13,620 

7,108 

3,362 

2,916 

98,362 

Midland 

17,465 

21,496 

6,729 

3,340 

1,624 

1,509 

52,153 

East 

4,016 

5,297 

1,820 

962 

457 

386 

12,938 

South-West . 

4,323 

6,065 

2,218 

1,177 

514 

580 

14,877 

Totals 

89,025 116,676 39,382 

20,572 

9,885 

9,108 

284,648 


The question we shall discuss is whether the average age at naarriage 
differs significantly between the different districts, i.e. we take " district " 
as the independent variable. This, apart from its sociological interest, 
might be an important point for decision if we were about to carry out 
a sampling inquiry into some quality which was related to age at marriage, 
such as numbers of children per family. 
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Taking the centres of the intervals to be 23, 27*5, 32*5, 40, 50 and 57*5 
years (the last being an approximation) we find — 

TABLE 22.3 


District 

Mean age 
(years) 

Degrees of 
freedom 

Sum of 
squares 

Quotient, sum of 
squares divided by 
degrees of freedom 

South-East 

29*68 

106,317 

7,092,490 

66*71 

North 

29*31 

98,361 

6,092.375 

61*94 

Midland 

29*01 

52.152 

3,105,520 

59*55 

East 

29*43 

12,937 

807,911 

62*44 

South-West 

29*87 

14,876 

1.025,284 

68*92 

Value for the 





whole area 

29-43 

284.643 

18,143.921 

63*74 


This is not a table in the form of Table 22.1. It merely exhibits the 
means and estimated variances for the different districts and the area 
as a whole. We note that the differences between districts are not very 
large but that the mean age at marriage is higher in the south than the 
north. Is this significant in the sense that it could not be a sampling 
effect such as would be obtained if the population were homogeneous ? 

The sum of squares between classes is obtained as the sum of deviances 
in the fourth column of the above table and is 18,123,580. This is not 
the sum shown at the foot, which is the deviance for the whole area 
and is derived from the figures at the foot of the Table 22.2. The 
difference between the two, 20,341, is the sum of squares between classes 
Zn^(xj—x )^ as can be checked by direct calculation from the means. 

We then find — 


TABLE 22.4 


Variation 

Degrees of 
freedom 

Sum of squares 

Quotient 

Between districts 

4 

20,341 

- 5085*25 

Within districts 

284,643 

18,123,580 

63*67 

Totals 

284,647 

18,143,921 



A test of significance is hardly necessary to show that the quotients 
are in fact significantly different. But if we wish to apply the z-test 
we proceed as follows — 
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We have 


« == i log. 


5085-25 

63-67 


==2-19 


= 4, V, = 284,643. 


From Appendix Table 6C we have, for the 0*1 per cent points for = 4 


(foTv^^eO) 0-8345 
(for V, == 00 ) 0-7648 

The observed value is far greater than these and hence is highly 
significant. Alternatively F=5085-25 /63-67 =8-0 which again is beyond 
the 0-1 per cent point (Appendix Table 5C). We conclude that the 
differences in the mean ages between districts, though comparatively 
small, are not accidental. 

Example 22.2. — Table 22.5 shows the yields of 30 plots of barley, 
there being six plots of each of five varieties. In this table the independent 
variable is the variety, so that rows and columns are interchanged as 
compared with Table 22,2. Moreover the number of plots for each 
variety is so small that we do not draw up a frequency distribution giving 
the number of plots with yields between certain limits (on the principle 
of Table 22.2) but simply the actual yields of the six plots. We are 
interested in the question whether there is any significant difference in 
the mean yields of the different varieties. 


TABLE 22.5.<-’Yield of grain in grammes on plots of barley of one square yard, there 
being five varieties and six plots of each 
The tabular arrangement does not represent the physical lay-out of the plots 
(Data quoted by Engledow and Yule, *TAa principles and practice of Yield Trials/* 1926) 


Plot 

number 

1 

2 

Variety 

3 

4 

5 

Mean 

1 

387 

372 

350 

340 

398 

369-4 


420 

455 

417 

360 

358 



353 

375 

400 

358 

334 



331 

328 

325 

370 

340 

338-8 


358 

383 

378 

395 

320 

366-8 

6 

400 

308 

275 

375 

430 

357-6 

Mean 

374-8 

370-2 

357-5 

366-3 

363-3 

366-4 


The mean of the whole is 366 • 4. The deviance is easily found to be 49,934« 
As in the calculation of a variance, we take some convenient working 
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mean to simplify the calculation. Similarly we find for the contribution 
between families, from the means of columns 


S (*#-*..)• =62 {xj-xJ* 
a i 

= 6{(374-8-366-4)»+ . . .+(363 -3 -366 -4)*} 


= 1043 

For the sum of squares within classes we merely subtract this quantity 
from the total deviance. Our analysis of variance then becomes — 


TABLE 22.6 


Variation 

Degrees of 
freedom 

Sum of 
squares 

Quotient 

Between varieties . 

4 

1,043 

260*75 

Within varieties 

25 

42,891 

1,715*64 

Total 

29 

43,934 



We have here an interesting case in which the variance between 
varieties is less than that within varieties. If this effect is real there 
must be some negative intraclass correlation present, a point to which 
we return below. To test the significance we have 

Vi=25, i^,=4 

From Appendix Tables 6A and 6B we see that, for these degrees of freedom 
the 5 per cent point is 0*876 and the 1 per cent point 1*31. The observed 
value lies between them and is just beyond the 5 per cent point. The 
result thus is barely significant, i.e. the evidence is weak that there is 
any real difference between the yields of the different varieties. 

Some practical pohits 

22.11 We proceed to consider a few practical points in the analysis and 
interpretation of variance anal}rsis in the case of a single independent 
variable. 

First of all, as r^ards the arithmetic. There is no difficulty about 
determining the number of degrees of freedom, and the only arithmetical 
lab(m ar^ from the determination of the sums of squares. The total 
deviance is determined exactly as in the calculation of variance. We 
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first find the mean, then, with a convenient working mean, determine the 
sum of squares about that mean, and finally transfer to the real mean by 
some such formula as 

2 (*«-*..)* = S (*|) —NxJ^ (22.9) 

a ij 

which is only (6.4) in a different guise. 

The next process is to determine the deviance between families. For 
this we require the family means Xj. Again with a working mean if 
desired (though, as in Example 22.2, it is not always necessary when there 
are only a few families) we calculate the contribution S«^(*j —*„)*. 

A point to watch here is that each contribution to the sum is weighted 
by the factor tif. In the case where all the »’s are equal we have 

S«(Xj— * )* = »S(x, —X )* 
i ' i 

= «S(x^)*-iVxP . (22.10) 

1 

The direct determination of the sum of squares within families is a 
tedious business when the numbers in the families are large and ungrouped. 
The required quantity can, however, be ascertained by subtraction as 
in Example 22.2. This sacrifices a check on the arithmetic but is the 
procedure usually followed. 

In the light of these comments the reader should verify the arithmetic 
of Example 22.2. 

We might add that the formal analysis of variance does not relieve the 
student from the necessity of looking at the data in a general way to 
make a preliminary comparison. In Example 22.1 we tabulated the means 
and remarked that they were not very different, even if significantly so. 
Our work may be regarded as the simultaneous testing of the significance 
of the differences between a set of means. Any pair of means can be 
compared by the f-test ; we have tested all the differences together. 

22.12 Consider now the application of the 2 -test. Strictly speaking, 
this is valid only when the parent population is normal. There is some 
evidence that in the contrary case the test remains valid provided that 
the departure from normality is not great, as for instance, in a great deal 
of biological material. But when the departure is considerable, special 
measures may be necessary to deal with the significance test. 

22.13 The reader will observe that the values tabulated in Appendix 
Tables 6 are all positive, which implies (since 2 is a logarithm) that in 
working out a variance ratio we always take the larger value for the 
numerator. In Example 22.1 we examined the ratio given by (variance 
between families) /(variance within families) whereas in Example 22.2 we 
took the reciprocd of this ratio. The general rule is always to take the 
larger figure as the numerator but this raises a point in connection with 
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the signihcance test on which it is well to be clear. Our significance 
values attached to a probability level of P per cent are chosen so that 
there is probability P /lOO that the values wUl be attained or exceeded. 
The probability that a ratio will attain or exceed a given value k, or that 
if it is less than unity its reciprocal will fall below Ilk, is 2P/100, twice the 
value for either contingency alone. When we are interested in either 
contingency the probability levels given in the Tables should be doubled. 

22.14 Appendix Table 6, will probably be sufficient for most purposes 
but it is worth recording that for large and v,, z is distributed approxi- 
mately normally with mean J variance J ^ 

Example 22.1, for instance, is so large that we may neglect its reciprocal 
and, since i'i=4 the approximate result leads to the conclusion that z is 
distributed normally with mean —0*125 and standard deviation 0*3535. 
The actual value of 2* 19 deviates from the mean by more than six times 
the standard deviation and is therefore highly significant. In our present 
example the test is rough because is not large, but for V| and greater 
than 30 the approximation is quite good ; and even for lower values it is 
useful to carry in one's head as a rough guide. 


Rdattonship with intra-class correlation 

22.15 In 11.38 we considered the intra-class correlation of a number of 
families. In the notation of the present chapter equation (11.33) can 
be written 


or 


s* = »s* 



£» 1 

s* 1 s* 


. ( 22 . 11 ) 


. ( 22 . 12 ) 


Now s^ is the variance of the total and is equal to 5 jnp where 5 is the total 
deviance. Also s* is S-^^jnp where is the sum of squares between 
families. Writing S, for the sum of squares within families (=S— S,) we 
find from (22.12) 


S H—1 S 


. (22.13) 


This fonniila exhibits the relation between intrarclass r and the con- 
stitatent items of the analysis into sums of squares. 


22.1S If now we denote by Q^ and Q, the quotients obtained from S, 
and S, we have 
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0 — 

From (22.13) we see that r is negative if and only if 

Sj>(« — l)Si 

which is equivalent to 

fQt>{i>-Wx • (22.14) 

This condition was verified in Example 22.2. It is of rather rare 
occurrence in practical cases. 

Two independent variables 

22.17 We now proceed to the case when the data are classified by two 
qualities A and B, p of one and q of the other, making pq sub-classes in 
all. We shall consider in the first instance the simple case where there 
is only one member in each sub-class. We shall denote the value of the 
member in the tth class of A and theyth class of B by We then have 
the algebraic identity 

= S (*, -X )*+2 (X .)» . (22.15) 

a a V 

The product terms in the expansion vanish a.s in the case of the single 
independent variables discussed in 22.2. This equation presents an 
analj'sis of the total sum of squares into three constituent sums. We 
state without proof that if all the data are drawn from a population with 
variance v the three items on the right are estimates of .(^— 1 )( 5 ’— 1)f, 
{p—l)v and {q—1) v respectively and they are independent each of the 
other two. We may then present an analysis of variance in the followii^; 
form — 

TABLE 22.7. — Ponn of analyiU of variance for two independent varlaUes wMi one 
member in each tab-dam 


Variation 

Degrees of 
fr^om 

Sum of squares 

Quotient 

Between i4*clas8es 

p-i 


S,/(P-1) 

Between B-classes » 

9-1 

S{s.y-sr„)‘=S. 

i§ 

V(9-l) 

Residual 


2 (*<y-ri.-S y+4r..)» 

-s. 

VlP-iKj-i) 

Totals 

pq-l 
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The first two items are obvious extensions of the variation between 
families which we encountered in the one-way case of the single independent 
variable. The item we have called “ residual " in this table has no very 
obvious interpretation but we may regard it as assignable to variation 
within the sub-class. Each contributory deviation may be looked on 
as the remainder when the effect of the classes A and B (if any) is removed 
For instance is the deviation of the value from the mean of values 

in the *th ^-class. The mean of over the B-dasses is Xj— and 

thus Xif—Xi_—{X f—x ) is the deviation from the average value obtained 
by taking means for the A- and B-classes separately. 


22.18 If the quotient for A is significantly different from the residual 
quotient we may conclude that there is heterogeneity so far as concerns 
A ; and similarly for B. We now meet a new point which did not arise 
in the case of a single independent variable. Suppose that the significance 
tests show that the data are heterogeneous in A. Can we then proceed 
to test for heterogeneity in B ? 

The answer in general is no, but there is one class of case in which it 
is affirmative. 

Suppose that the value Xff is made up of three independent and additive 
parts. 

(1) the effect of belonging to the class A,, say 

(2) the effect of belonging to the class B^, say bf. 

(3) a residual which is normally distributed with zero mean and 
variance v. 

Then we have 

Xif — (22.16) 

The reader should consider this hypothesis carefully. It is equivalent 
to an assumption that the observations are affected by a systematic effect, 
«!, which varies from one ^l-class to another but affects all B-classes alike 
in the sub-class A^; a similar effect for B ; and the residual normal effect. 


22.19 If is the population mean of x^f, a that of Xf and so on we have 
from (22.16) 


== 

fn,, s* 

m, =a+bf 
m =» « -fft 

Then 


. (22.17) 


• • ( 22 . 18 ) 
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the product term vanishing as usual. Now from (22.17) it is clear that 
ntfi — vanishes and hence the right-hand side of (22.18) 
reduces to its last term. Thus the reudual quotient is an estimator of the 
variance which has just the same value as if and bf were non-existent. 
That is to say, on the hypothesis represented by (22.16) the residual 
quotient continues to offer an estimate of v, the variance of 

It follows that, on this type of hypothesis, even if the i4-effects are 
significant we can still test for the B-effects with the aid of the residual 
quotient. We may also note that, in any case, if the m^'s are small, the 
residual variance is not greatly affected so that an approximate test can 
be carried out. 

Example 22.3. — ^The following is an example in which the dependent 
variable is or may be subject to the influence of two independent variables. 
Four varieties of potato are planted each on five plots of ground of the 
same size and type ; and each variety is treated with five different fertilisers. 
The 3delds in tons are as follows — 

TABLE 22B 


Variety 

1 

2 

Fertiliser 

3 

4 

5 

1 

1-9 

2-2 

2-6 

1*8 

2*1 

2 

2-5 

1-9 

2-3 

2*6 

2*2 

3 

1-7 

1-9 

2*2 

2*0 

2*1 

4 

21 

1-8 

2-5 

2*3 

2-4 


We require to consider whether there is evidence that (a) any difference 
exists between the yields of varieties independently of the fertiliser and 
(6) any differential effect is exerted by the fertiliser independently of the 
variety. 

Before canying out an analysis let us look at the data generally. Since 
each variety is treated once and only once with each fertiliser, we may 
expect that comparisons of totals for the four varieties are permissible ; 
the total yield of one variety is comparable with that of another because 
they are both treated by the different fertilisers to the same extent. 
Similarly, a comparison of fertiliser effects is legitimate because each 
variety is equally represented in the five fertiliser totals. The data may 
be said to be balanced. 

It will simplify the arithmetic if we measure our yields about mean 
2*0 and express them in tenths of a ton. Table 22.8 then becomes, on 
the insertion of totals — 
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TABLE 22.9 


Variety 

1 

2 

Fertiliser 

3 4 

5 

Total 

1 

-1 

2 

6 

-2 

1 

6 

2 

5 

-1 

3 

6 

2 

15 

3 

~3 

-1 

2 

0 

1 

-1 

4 

1 

-2 

5 

3 

4 

11 

Totals 

2 

— 2 

16 

7 

8 

31 


The sum of squares of yields (the 20 values in the main body of the table) 
will be found to be 191. We then have 

=31/20 = 1-55 

Nx* = 48 05 


Z (%-*.)* = 2 {x^)-Nx*,= 191-48-05 
«j' »y 

= 142-95 

with (5x4)— 1 = 19 degrees of freedom. 

We may now obtain the sum of squares between varieties direct from 
the row totals of the table. These totals are, in fact, five time the means. 
The sum of squares of means is thus 1 /25 of the sum of squares of row 
totals ; but (and here is a slight trap) each square of a mean is to be 
counted five times in ascertaining the sum of squares between varieties.* 
Thus the latter quantity is given by the sum of squares of row totals, 
divided by five, less Nx*. The sum of squares of row totals in Table 
22.9 is 383 and thus the sum of squares between varieties is 


383/5 -48-05 = 28-55 


with three degrees of freedom. 

Similarly, the sum of squares of column totals is 377 and hence the sum 
of squares between fertilisers is 

377/4 -48-05 = 46-2 

with four degrees of freedom. 

The analysis of variance then becomes — 
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TABLE 22.10 


Variation 

Degrees of 
freedom 

Sums of 
squares 

Quotient 

Between fertilisers 

4 

46-2 

11-55 

Between varieties 

3 

28-55 

9-52 

Residual 

12 

68-2 

5-68 

Totals 

19 

142-95 



To test the effect between fertilisers we have 

2 = i log, ^ = 0-3545, Vi =4. v, = 12 

This is not significant, being well below the 5 per cent point. Similarly, 
for the effect between varieties 

Q. QO 

« = t log, = 0-2609 

which again is not significant. A test of the variance-ratio direct leads 
to the same conclusions. 

We conclude that for thesfe data there is no evidence of heterogeneity, 
i.e. that they could have arisen from a population in which there was 
no difference between the yields of varieties and the fertilisers did not 
differ in their effect. 

Significance of the correlation ratio 

22.20 At this point we turn aside from the development of the general 
theory to show how the analysis of variance provides accurate tests of 
significance for the correlation ratio, regression coefficients and the 
multiple correlation coefficient. 

The distribution of 1 ^* in samples from an uncorrelated normal population 
be derived from Fisher’s z-distribution. Hence we may test whether 
an observed value of 9 * is significant of the existence of correlation in the 
parent, assumed normal or approximately so. 

When considering the correlation ratio in 11.6 we saw that for the 
array of x's 

<r! 

where 

oj is the variance of the whole 
is the variance within arrays 
ajj, is the variance of array means 




5i8 


THEORY or STATISTICS 


If there axe p arrays and n, is the number of members in the yth array, 
we may write this in the notation of the present chapter. 

( 22 . 19 ) 

Now let us regard the arrays as families or classes, and the items of the 
arrays as class-members. Equation (22.19) is then an analysis of variance 
in the following form : 

TABLE 22.11 


Variation 

Degrees of 
fr^om 

Sums of squares 

Quotients 

Between classes 

p-\ 

m 

1 

•W* 

p-\ 

Within classes 

N-p 


N-p 

Total . 

s-\ 




In the last column we have anticipated results which are easily proved 
as follows — 

By definition, 

= Nct*(i 

Hence, x,.)* = 

Dividing the sums of squares by the appropriate number of degrees of 
freedom, we get the results of the final column. 

Now, if the population is normal and uncorrelated, the two quotients 
are not significantly different ; for they are independent estimates of the 
variance of x in the population, all arrays having the same mean and 
standard deviation.^ We may test the significance of their difference by 
the x-distribution. We have — 


* = I toff. 


iVo.«7* /No,»(l-f*) 
p-l / N-p 


== i tog. 


if-p 


V, =^-l 


. ( 22 . 20 ) 
. ( 22 . 21 ) 


* Strictly speaking, this is only approximately true of arrays of finita width. If tht 
raiifea defining the arrays are very broad, the teat most be used with leserva. 
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In equation (22.20) we have omitted the suffix xy in writing 7 *. Clearly 
a similar test may be applied to p in this case referring to the number 
of y-arrays. 

22.21 From the relation (22.20) between z and it may be shown that 
the distribution of corresponding to that of z given by equation 
(21.18), is 

. . .( 22 . 22 ) 

It will be seen that this involves the number p, i.e. depends on the 
number of arrays into which the data are grouped. This fact is important, 

and reveals that the use of the standard error given in 19.27, can 

be no more than an approximation at the best ; for that formula does not 
contain p. 

22.22 It is interesting to note that, since is positive, its mean value 
will not be zero. The mean value (which differs from the square of the 
mean value of tj) is given by 

(,*) == (22.23) 


Example 22 . 4 . — Let us consider the data of Table 9.3 (correlation 
between stature of father and stature of son), in which 7,*=7,a=0'52. 
We know that the distribution is approximately normal, a fact which is 
borne out by the approximate equality of the two correlation ratios, and 
hence we may apply the foregoing theory with considerable confidence. 

We have, for — 

v^—p—\ — 16 

y, = N-p = 1078-17 = 1061 


, , (0-52)* 1061 

* - i _(0.‘52)*’ 16 

From Appendix Table 6 C we see that the 01 
points are as follows — 


Fj = 12 


24 


1-60 

per cent significance 


Vj=60 0-5992 0-4955 

IF, - CO 0-5044 0-3786 

The observed z is therefore very strongly significant of correlation in 
the population. 


Test of linearity of regression 

22.23 In 11.7 we saw that the regression of y on * was linear if, and 
only if, r*=0. An important question to decide is, therefore, can 
an observed value of have arisen from a population in whidi the 

regression is linear, i.e. the true value is zero ? 
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This question can be decided by the x-test in a similar manner to that 
of 22.20 and 22.21. We consider the analysis of the sums of square of 
deviations from the regression line into two parts : (1) deviations within 
arra}rs, and (2) deviations of means of arrays from the regression line. In 
this way it may be shown that the linearity may be tested by taking 


X = J log. 


N-p 

1— 2 


. (22.24) 


=p-2 

N-p 


(22.25) 


Example 22.5. — In considering the correlation between old age, 
pauperism (x) and the proportion of out-relief (y), Yule found {Economic 
Journal, 1896, 6, 613) 

JV =235 


r = -1-0-34 
= 0 *^ 
= 0 '^ 


for a grouping of 19 *-arrays and 8 y-arrays. Can the regressions be 
supposed linear ? 

For the *-arrays, N—p = 216, p—2 — 17 


_ (0>46)«-(0-34)« _ 
1_,»- l-(0-46)* 


2 = J log, 
= 0-218 


(o- 12177 


The 5 per cent point for >>1=17, », is about 0-25, and there is thus 

no reason to suppose from the observed z that the regression is not linear. 
Alternatively for the variance ratio F we find 

(216x0-12177) _ , „ 


For the y-arrays, similarly, p—2 — 6. 


2 



(0-39)*-(0-34)* 227\ 
l-(0-39)» * 6 / 


= 0-244 


This also will be found to lie within the sampling limits, and the test 
thodore does not reject the linearity of dither regresaon. 
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Si^nlficaiice the mtilt4>le conrelatioii co^Bdent 

22.24 The multiple correlation coefficient is in many ways analogous 
to the correlation ratio, and we may test its significance by a procedure 
very similar to that used for the significance of the correlation ratio and 
regressions. 

Consider the regression equation with p variates, 

*1 ~ • • • "l"^*** 

the variates being measured from their means. 

We may regard the deviations of observed values of as composed of 
two parts : (1) deviations from the values of given by the regression 
equation, and (2) deviations of the latter from the mean of x^. The sum 
of squares can be analysed accordingly. 

The sum of squares of deviations of observed values of Xy from the 
mean of x^^Noy}, by definition, and has N—l degrees of freedonu 

The sum of squares of deviations of observed *j’s from the regression 
values is No?.* . . . , which, by the definition of i?j(, ...,), is equal to 
Aroi*(l — . . . ,)). This has N—p degrees of freedom, for Oj* has 
N— 1 degrees of freedom, of , has 2V— 2 degrees, and so on. Writing 
R for i?i(, we may express the analysis in the following tabular 
form ; — 

TABLE 22.12 


Variation 

Degrees of 
freedom 

Sums of squares 

Quotients 

Between classes 

(Regression values from 
mean.) 

p-\ 


XT 1 

1— Jff* 

Within classes . 

(Deviations irom regress- 
sion values.) 

N-p 



Total 

N-l 

NOTj* 



Now if the parent value of JR is zero, the quotients should not differ - 
significantly ; for Xy and • * • +^p^ii Are then uncorrelated, and 

hence deviations of x from the regression values are uncorrelated with, 
and independent of, deviations of the regression values from the mean, 
the population being normal. 

Hence we may test the significance of R by putting 


z *= 


i log. 


R* N-p 


Vy — 1 

*'t = N-p 


. (22.26) 
. (22.27) 


E 
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It will be seen that equation (22.24) is of the same form as equation 
(22.20). The distributions of /?* and are formally identical, and we 
have, for instance, corresponding to equation (22.23), 


(/?*) 


p-\ 

N-\ 


(22.28) 


Example 22.6. — In Example 12.3, page 299, we found I?j(„)=0*74. 
Is this significant ? 

We have — 

p =3. JV = 38 
V, = 2, F, = 35 


z = 



(0-74)* 

l-(0-74)* 



= 1-53 


For »'i=2, the 0-1 per cent significance points are — 

• r, = .30 1 -0859 

V, =40 1-0552 

The observed z is well above these values and hence R is significant. 


Unequal numbers in classes 

22.25 The treatment given in 22.16 to the case of two independent 
variates was based on the assumption that there was only one member in 
each sub-class. In the contrary case an accurate treatment is much more 
difficult and we shall not be able to deal with it here. The following 
remarks are intended as a preliminary to further reading — 

(«) If the number in each sub-class is the same the foregoing theory 
still applies. 

(6) The theory also applies if the numbers in sub-classes are propor- 
tionate, that is to say, if the frequency in the sub-class Af Bf is a. 
constant multiple of (,4<) {Bf) where {Ai) and {Bf) we the frequencies 
in the classes Ai and Bf respectively. 

(c) In other cases the theory does not apply ; but if the numbers in 
sub-classes are not very different from equality or proportionality, 
an analysis carried out on the means of sub-classes as if they were 
the primary data, one to each sub-class, will probably not be 
misleading, although it sacrifices some information. 

(i) In any case a pxq classification with more than one member in 
the sub-classes can always be regarded as a one-way classification 
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into pq classes. An analysis on these lines will provide a test of 
homogeneity but does not distinguish, as it were, whether 
departures from homogeneity are due to A or J5 or to a mixture 
of both. 

Non-normal variation 

22.26 Some comments are also desirable, though again the matter is 
too complicated for detailed treatment, on the assumptions of normality 
which underlie the exact treatment of significance tests in the analysis 
of variance. When the parent population is not normal estimates of 
means are not independent of variances, so that the quotients given by 
the analysis are dependent. Further, the logarithm of the variance- 
ratio is no longer distributed in the «-form. We have already referred 
to the fact that sampling and theoretical inquiries suggest that if deviations 
from normality are only moderate, the theory still applies as an approxima- 
tion. Sometimes the variate may be transformed so as to bring it nearer 
to normality or the variances in the different classes nearer to equality. 
In certain cases, by a process of randomisation before the data are collected, 
it may be ensured that the r-test remains valid even where the parent is 
not normal, though this apiounts to a change in the nature of the inference. 
These topics, however, are outside the scope of this book. 

The case of three independent variables 

22.27 The results appropriate to two independent variables may be 
extended. The general case of n independent variables is rather com- 
plicated and indeed data so completely specified for n greater than three 
are rare. We shall conclude this chapter by stating without proof the 
results for three independent variables, commenting on one or two new 
points, and giving an example. 

Consider then the case where there are three classifications into A~, 
B- and C- classes, one member in each sub-class typified by %». With 
an obvious generalisation of previous results we have (summation extend- 
ing over all i, j, k) 


S(*«*-* ..)* == 


(22.29) 




the summations extending over aU members of the sample, say Pqr in 
number, where there are p A-classes, q B-classes and r C-dasses. 
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Etch item on the right in (22.29) provides an estimate of the parent 
variance on the hypothesis of homogeneity. The first three items are 
of the type "between classes" which we have already encountered. 
The next three are known as interaction terms. The last is a residual and 
may also be regarded as an interaction of second order. We have then 
an analysis in the following form. 


TAUiE 22.13.-^0011 of analyito of variance for tlirec independent variaUei vrith one 
member in eadi enb-ciam 


Variatioii 

Degrees of 
freedom 

Sums of squares 

Residual 

Between -classes . 

„ B-classes . 

,, C-classes . 

Interaction AB 

BC . 

.. CA 

Residual 

p~\ 

r~\ 

ip-l)(q-i) 

{q-l){r-l) 

{r-l){p-l) 

(p-l){q-l) 

(f-l) 

J:k*-*J* 

~ ~ j* ~ 

The quotient 
of the sum of 
squares by the 
corresponding 
number of 
degrees of 
freedom 

Totals 

pqr-l 

1 

I 



22.28 As in 22.18 and 22.19, if the variate is regarded as the sum 
of three class effects a^, and and a normal residual the residual 
quotient continues to provide an estimate of the variance of f. It is 
therefore customary to test the quotients between classes in relation to 
the residual quotient. . 

We also have, however, three interaction quotients which, on the 
h}rpothesis of homogeneity, should also be equal, within sampling limits, 
to the residual quotient. If the interaction quotient is not equal, 
within such limits, to the residual we must reject the hypothesis that the 
variation can be expressed as the sum of the two class effects and bf. 
The class effects are, so to speak, entangled, or they “ interact." Similarly 
for the other two interactions. 

Example 22.7.— The following example t 3 rpifies a situation of fairly 
general occunence but has been simplified somewhat- to reduce the 
arithmetic. Suppose we l^ave two manurial treatments which we wish 
to test. We will suppose that they are each applied to five varieties of a 
cereal, and that, to give the experiment greater generality, it is repeated 
at four different stations. Our 40 yields are then classified into a 4 x 5 x 2 
grouping, four stations, five varieties and two treatments. We will 
suppose that the yields, measured about some convenient working mean, 
and expressed in some convenient unit, are as given in Table 22.14, wbereiD 
Ji and Ti refer to the two treatments. 
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TABLE 22.14 

Varieties 

3 Totals 

Stations . , • ■, . 



Tx 

T, 


T, 

Ti 

T. 

Tx 

T. 

T, 

T, 

Tx 

Ti 

1 

-6 

-4 

-4 

-2 

-10 

-7 

~3 

-5 

1 

0 

-22 

-18 

2 

-2 

-1 

-5 


-3 

-4 

-4 

-1 

^2 

1 

-16 

- 6 

3 

3 

2 

-2 

3 

-4 

0 

4 

1 

3 

3 

4 

9 

4 

3 

6 

3 

2 

6 

3 

-1 

5 

6 

8 

17 

24 

Totals 

-2 

3 

-8 

2 

-11 

-8 

-4 

0 

8 

12 

-17 

9 


The sum of squares of the 40 values in the main body of the table will 
be found to be 640. Thus we have 


* = -8/40 = -0-20 

= 1-6 

2(%* -*..)* =640-1-6 
= 638-4 


Now we find the sum of squares between stations (S), varieties (7)» 
and treatments separately. The yields for the four stations are the 
totals of the two columns on the right in Table 22.14, namely, -40, -22, 
13, 41. The sum of squares of these values is 3934. Now (the first 
suffix referring to S) 

S -X )«= E - • W 

i.i,k i.j.k 

In the column totals there are 5x2=10 members contributing to the 
§um ; but the summation on the right in (5) takes place over the four 
stations and the 5x2=10 members for each station. Thus 


E *,,* = 10E!r<.* 

i,i,k 


10 * 7 -^’ 


where the y ’s are the totals. 
Thus 


= 393-4 


2 (*!..- *...)• 


383-4-1-6 


‘891 -ft. 



and this gives the sum of squares between stations. 


(«) 
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Generally, if we require the sum of squares between ^-classes in a 
p X q X r classification we have 

ij.h ?>' i 

The five totals of varieties are 1, —6, —19, —4, 20 with a sum of squares 
equal to 814. Thus for the sum of squares between varieties we have 

^^,-1.6 = 100-15 (d) 


We leave the student to check as an exercise that the sum of squares 
between treatments is 16*9. . . . . . («) 

Now we have to find the interaction terms. For this purpose it is 
most convenient to condense the primary Table 22.14 into three others, 
of which we will write down one. If we add the yields for the two treat- 
ments on any particular variety and station, we obtain the following — 


Stations 

1 

Varieties 

2 3 

4 

5 

Totals 

1 

-10 

- 6 

-17 

- 8 

1 

-40 

2 

- 3 

- 6 

- 7 

- 5 

• - 1 

-22 


5 

1 

- 4 

s 

6 

13 


9 

5 

9 

4 

14 

41 

Totals 

i ■ 

- 6 

-19 

- 4 

20 

- 8 


The sum of squares of values in the main body of the table will be found 
to be 1 1 12. Each entry is the sum of two values and, with an obvious 
extension of previous results we have 

1112 

2 (%-*..)• = - 2 -1-6 -=554-4 . . (/) 

Now for the interaction S F we have 

i:(Xtf—Xi -Xj +x )* = X )•— E(x< — X )*— 2(xy — X ,)* 

Substituting from (/), (c) ^d (d) we have on the right 

554-4 -391-8-100-15 

= 62-45 .... (g) 

whidi is the required interaction sum of squares for S V. 
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Again we leave the student to calculate the other two interactions to 
obtain that for VT as 3*85 and that for TS as 2*10. We have finally 
(the residual sum of squares being calculated by subtracting the sum of 
the other terms from the total deviance) — 

TABLE 22.15.— Analy^ of variance of Table 22.14 


Variation 

Degrees of 
freedom 

Sum of 
squares 

Quotient 

Between stations (5) 

3 

391*30 

130*60 

Between varieties ( V) 

4 

100*15 

25*04 

Between treatments (T) . 

1 

16*90 

16*90 

Interaction SV 

12 

62*45 

5*20 

VT 

4 

3*85 

0*96 

ST . 

3 

2*10 

0*70 

Residual 

12 

61*15 

5*10 

Total 

39 

638*40 



Now we first of all test our interactions against the residual term with 
a quotient of 5-10 and 12 degrees of freedom. We find in fact that 
they are not significant to a 5 per cent level. This implies that we may 
assume that there is no “ entanglement ” between the factors and that 
there is support for the hypothesis that the three are affecting yields 
independently. We can then turn to a consideration of the main effects. 

We find that the differences between stations are highly significant, 
those between varieties are not significant at a 1 per cent level but are 
so at a 5 per cent level, and that differences between treatments are not 
significant. We conclude that the variation in yields is due to variation 
between stations and (perhaps) between varieties, but cannot be ascribed 
to real differential effects between treatments without further inquiry. 


SUMMARY 

1. The analysis of variance is essentially a procedure for testing the 
differences between different groups of data for homogeneity. 

2. For a single independent variable (classification into groups according 
to one quality) an analysis may be carried out to show estimates of the 
variance between and within classes whether the class-numbers are equal 
or not. Homogeneity may be tested by comparing the estimates. 

3. For small samples and normal parent variation the ratio of between- 
and within-class variance may be tested in Fisher’s s-distribution. 
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4. For classification according to more than one quality a more elaborate 
form of analysis may be employed. The method applies only when the 
numbers in sub-classes are equal (or more generally, proportionate) but 
is probably a fair approximation when they are near Equality. 

5. The exact test of significance does not apply to non-normal variation 
except as an approximation, but where departure from normality is not 
great, the approximation is probably fair. 

6. The analysis of variance provides exact tests of significance (in the 
case of normal variation) for the correlation ratio, departure from linearity 
of regression, and the multiple correlation coefficient. 


EXERCISES 

22.1 The following shows the lives in hours of four batches of electric 
lamps — 

Batch 1 : 1600, 1610, 1650, 1680, 1700, 1720, 1800 

Batch 2 : 1580, 1640, 1640, 1700, 1750 

Batch 3 : 1460, 1550, 1600, 1620, 1640, 1660, 1740, 1820 

Batch 4 ; 1510, 1520, 1530, 1570, 1600, 1680. 

Perform an analysis of variance on these data and show that a significance 
test does not reject their homogeneity. 

22.2 Considering two samples as two families of values, derive an explicit 
form for the ratio of estimated variances between and within families 
and hence derive the f-test for the difference of means in normal samples 
with equal variances as given in 21.21. (The distribution of the variance- 
ratio for Vi—1 reduces to that of <*). 

22.3 Four experimenters determine the moisture content of samples of 
a powder, each man taking a sample from each of six consignments. Theis. 
assessments are — 


Observer 

1 

2 

Consignment 

3 4 

5 

6 

1 

9 

10 

9 

10 

11 

11 

2 

12 

11 

9 

11 

10 

10 

3 

11 

10 

10 

12 

11 

10 

4 

12 

13 

11 

14 

12 

10 


Perform an analysis of variance on these data and discuss whether there 
is any significant difference between conrignments or between obosTveis. 
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22.4 Verify the arithmetic and the significance tests of Example 22.7. 

22.5 Test the sig^nificance of the two multiple correlation coefficients of 
Example 12.3, page 299, other than the one tested in Example 22.6. 

22.6 Test the linearity of the regression of the distribution of cows of 
Table 9.4, page 204 (referring to Exercise 13.1). 

22.7 Examine how, in the analysis of variance, sums of squares between 
classes may be regarded as interactions of zero order and (in the case 
of three independent variables) the residual may be regarded as an 
interaction of the second order. 

22.8 (Data from Mahalanobis, J, R. Statist, Soc,, 1946, 109, 325). The 
following table shows estimates of an index of the cost of living in an 
area of Bengal in 1945 made by five investigators each working in each of 
five areas. 


Investigator 

1 

2 

Area 

3 

4 

5 

1 

270 

263 

264 

263 

260 

2 

280 

265 

274 

274 

279 

3 

275 

284 

278 

271 

296 

4 

1 

271 

269 

272 

297 

274 

5 

279 

267 

269 

263 

284 


Perform an analysis of variance to see whether there are significant 
differences between areas and between investigators. 
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SOME PROBLEMS OF PRACTICAL SAMPLING 


23.1 In the previous seven chapters we have discussed the interpretation 
of samples and developed various branches of theory which are designed 
to give precision, in the sense of the theory of probability, to inferences 
drawn from the sample to tire population. At the outset (Chapter 16) 
we considered briefly the types of sampling to which our theory is 
applicable, noting in particular the fundamental importance of randomness 
in the selection of data. We shall now examine in more detail some of the 
problems arising in the selection of samples to which our theory may apply. 

23.2 The complete process of sampling consists in effect of three stages, 
there being considerable scope for judgment at each stage. 

(1) If there is no natural unit, and often even if there is, we have to 
decide what shall be our unit for the purposes of sampling. If our problem 
is, for example, to determine the mean yield per acre of a certain crop 
over a certain large area, there is no natural unit of area over which the 
yield can be measured at each of n points in the large area. We must 
therefore fall back on practical considerations to decide whether our 
sampling unit shall be something very small, say a square yard, something 
a good deal bigger, say 1 /10th acre, or something larger still, such as an 
acre or more. If, on the other hand, the problem is to estimate by way 
of sampling the proportion of a certain human population possessing a 
certain characteristic, such as blue eyes, or surname beginning with H, 
or age under 21, the natural unit is the person ; but this, as we shall see 
presently, is not necessarily the most convenient unit for sampling 
purposes. 

(2) The unit having been fixed, the next step is to decide what shall be 
the process of sampling : if it is agreed that the process should be a 
random one, how is this randomness best secured ? If it appears possible 
that some departure from unrestricted random sampling may lessen the 
cost, or may even lower the standard error of estimation, what then shall 
be the procedure and will this procedure carry with it any countervailing 
risks? How are we to treat the cases in which a member that we intended 
to include cannot be found or, if found, will not provide a reply ? 

(3) The sample having been taken, i.e. the specific units to be included 
in the sample having been determined, the final stage of the work is the 
measurement, description, or (to use the term in a very general sense) 

530 
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what we may call the examination of the units included in the sample. 
Properly speaking, this is no part at all of the sampling process in the 
narrower sense ; that was completed when we had determined which 
specific members of the population were to be included in the sample. 
Examination of the units is a process of observation such as we would 
have had to carry out even if we had decided to deal with the entire 
population and not a mere sample. But it is a process fundamental to 
our work and must be considered here, for careless or incompetent 
" examination ” may lead to the most serious, and sometimes astonishing, 
errors. 

We will consider these three stages in the order given, as this will couple 
the work of the present chapter most closely and logically with that of 
the preceding chapters. 

Size of the sampling unit 

Example 23.1. — Effect of size of unit on bias 
We take, first of all, an example illustrating the importance of the 
sampling unit in some types of inquiry. In an investigation into the 
yield of jute in Bengal in 1940-41 (Mahalanobis, J. Roy. Stat. Soc., 1946, 
109, 325) material was collected for five different sizes of sample-cut from 
the fields, ranging from one square foot to 256 square feet. In each 
field (which was selected at random) an area of 16x16 feet was chosen, 
also at random, and the crop was harvested in a number of sub-cuts 
suppling yield rates for the sizes : 1 x 1, 3 x 3, 12 x 4, 12 x 12 and 16 x 16 
feet, the latter being the whole plot. The following are the estimates 
of the )deld in lb. per acre based on the various plot sizes — 

Size (ft.) Estimated yield (lb. per acre) 
lx 1 27,271 

3x 3 17,462 

12 X 4 16,080 

12x12 16,763 

16x16 16,828 

Evidently the estimates based on the two smallest sizes of plot are 
seriously biased. In this particular case it was easily shown that the 
differences could not have been sampling effects. 

The reason for this effect is not yet beyond doubt, but apparently it 
is due to unconscious bias on the part of the observer, who, in measuring 
out the plot, has a tendency to include rather than to exclude plants on 
land near the boundary. This effect naturally diminishes in proportion 
as the plot becomes larger. The remedy in this case is dear ; it is simjdy 
not to use plots which are too small. 

23.3 For all practical purposes the case we have just considered may be 
regarded as one in which the area covered is continuous, so that there 
is no " unit ” indicated by the nature of the data. We cotdd, it is true, 
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regard the individual plant as the ultimate unit ; but for practical reasons 
we cannot, in an extensive inquiry, bother ourselves with the selection 
of plants. We must select fairly large areas, and the question then 
arises how the size of those areas is to be determined. In Example 23.1 
the bias appearing for very small areas dictated a lower limit to the proper 
size but did not suggest an upper limit. 

23.4 Even for discontinuous units the same type of question can arise. 
Suppose, for example, we are sampling a country for the purpose of 
determining the size of population or some similar demographic character- 
istic such as would be given by a census. The ultimate unit is the indi- 
vidual human being, but it may be very troublesome to pick out individuals 
at random. Shall we lose anything by sampling with families as units, 
or houses, or streets, or blocks or even whole wards ? Again, in an 
agricultural inquiry, do we lose anything by taking as our unit the farm 
instead of the individual field ? 

23.5 Such questions rarely admit of a simple answer. In general there 
will be a group of considerations in favour of choosing as large a unit as 
possible and another group in favour of choosing a small one. Among 
those of the first kind we may mention economy (e.g. because less time 
and travelling are involved if the individuals are grouped and have to 
be visited, or because information has already been tabulated for the 
larger units). Among those of the second are the desirability of not 
clustering sample-members too closely when the population is thought 
to be " patchy ". Additional complications may arise when our *' units *' 
are of different sizes, such as farms, for then there is some intuitive ground 
for feeling that the different units ought to be given varying weights. 
When the sizes of the imits are known we can sometimes deal with the 
problem as one of stratification, which we consider below, but there are 
some rather complicated points arising in this branch of the subject 
which have not yet been completely solved. 

Some saiiq>]iiig procedures 

23.6 We shall now consider some sampling procedures which depend 
for their efficacy on prior knowledge of the population. When nothing 
is known about the population a purdy random selection of members 
is the best. It avoids bias and can be made to provide information about 
the standard errors of the quantities under estimate. Only rarely, 
however, do we embark on an inquiry in complete ignorance about the 
parent population. Our knowledge may be only vague and general, but 
even so we can often apply it to improve the precirion of our estimates. 
Moreover, it is often highly inconvenient and expensive to draw a purely 
random sample from a large existent population (e.g. by the use of random 
sampling numbm) and practical necmsity may dictate a modification of 
the random process even though no theoretical gain in accuracy or 
precision may result. 
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Stratified tampllng 

23.7 We referred briefly in 16.39 to the process of stratification, in which 
we divide the population into strata and draw a random sample of 
specified size from each stratum. Sometimes our stratification may be 
a purdy geographical basis, as for example if, in sampling farms from 
England, we decide to draw a certain proportion from each individual 
county. Sometimes it may be by reference to a variate-value, as when 
we decide to draw certain numbers of farms in certain size groups irrespec- 
tive of their geographical position. The operation of stratification may be 
undertaken either to improve the value of an estimate or merely for 
administrative convenience. If the strata are determined by some 
" natural ’* factor the sampling process by stratification will also facilitate 
comparison of the strata among themselves, which may be a subsidiary 
object of the inquiry. 

Sampling firactkms 

23.8 Suppose we have a population stratified into k strata, the number 

in the »th stratum being and the totad number being N. We 

take a sample of n members such that the number chosen from the *th 
stratum is «<. Suppose that we desire to estimate the mean value a of 
a variate x in the whole population. How shall we choose the numbers 

We shall assume that if % is the yth member of the sample of n^, the 
estimate is of the form 

S S . (23.1) 

<-i i-j 

where the A’s are constants to be determined. This assumption may be 
expressed by saying that we are looking for a linear estimate. Among 
aU the possible estimating functions of this kind we shall seek the one 
which has the smallest variance. There are obvious advantages in an 
estimate with the minimum of sampling fluctuation. 

If the mean value of Xtf in the (th stratum is a, we have 


a 





(23.2) 


Thus, writing E to denote the taking of a mean value we have, from 
(23.1) and (23.2) 

e| S (A|,%) I . (28.3) 

and mce, by definition £(»||)a«a| we have 


k 


Z 



0 


(23.4) 
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M this is to be generally true independently of particulaur values of 
we must have 

. (23.5) 

This provides a first condition on the A's in order that the estimate may 
have the true value as its mean value — that it should be unbiased in a 
sense we define in 23.17. If is the mean of A^^ m the tth set we may 
write this as 

■ . (23.8) 

Now consider the condition that the variance of i shall be a minimum. 
Since E denotes a mean value we have for the ith stratum 




var 2 {At/Xii) 
y-i 


‘[s{A„(*«-a,)j]* 


This is equal to 


a<)*+ A<yAj,(«y— aj)(ar^^— a()J 


where 2' denotes summation over values of j and I except those for which 
j=t. If the variance of in the »th stratum is o,* this is equal to 

2 . . . (23.7) 

i y.» 

Now since there are iV<(N<— 1) values for which 

1 r( 1* ^ 1 


<»<• 

N,-l 


(28.8) 
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Fiom (23.7) and (23.8) we then have 

var Z(A^X(|) = S' z 

i i i,i 

-JV,SA*.-n<A?.| 

+«,(iV,-n,)A,.*| , . (23.9) 

Now t is the sum of k items, each of which comes from a different stratum 
and is therefore independent of the others. Consequently the variance 
of t is the sum of the constituent variances, i.e. is the sum over i of the 
expression on the right in (23.9). This is clearly a minimum if, for all i 

Aii-Xi, = 0 (23.10) 


This is equivalent to sa 3 dng that within any substratum the A’s must be 
equal, wMch is what we should expect, for there is no reason why one 
should be greater than another. 

We then have 


var t 


* a?(JV,-«,) 



_ 1 iV,» 

■“ N* Ni-l n, 

1 NMt* 

~ constant (^.11) 


We have to minimise this for variations in subject to 


S = » = constant . . . (23.12) 


It may easily be shown by the use of differential calculus that the minimal 
values of n, are given by 


00 


Ni* 

Nt-l 




. (23.13) 
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If now Nf is large we have approximately 

nf a . . (23.14) 

or 

. (23.15) 

Thus the ratio which the sample-number n< bears to the stratum number 
Ni varies as the standard deviation of the stratum. 

23*9 This interesting result has some important applications in stratified 
sampling. We need not consider the case in which the cr's are known 
exactly (for we should rarely have this knowledge without knowing the 
means, m which case we should not be estimating the mean of the whole 
from the sample). There remain, however, two classes of case where the 
result is useful ; when — 

(a) The standard deviations are known approximately from prior 
information. In such a case we can determine the a*s from (23.15) to 
some degree of approximation. An estimate based on a sample obtained 
in this way, though not perhaps as good as it might be, will at least be 
better than if we had ignored our knowledge of the standard deviations. 

(b) A pilot inquiry on a small scale can be conducted to determine the 
standard deviations approximately. This will bring us back to case (a). 

Example 23.2. — (Data from Yates, /. Roy, Slat, Soc., 1946, 109, 12). 

The Farm Survey of England and Wales covered all holdings of five 
acres or more. Prior information was available as to the size-distribution 
of these holdings as follows — 

Size group (acres) Number of holdings 


5 and less 

than 

25 

101,450 

25 „ „ 

tf 

100 

111,360 

100 „ „ 

it 

300 

65,210 

300 „ ,, 

>» 

700 

11,150 

700 and over 



1,430 


290,600 

We wish to take a sample, say, of about one in seven, or about 40,000 
holdings, in order to estimate some factor for the population of farms 
such as the arable acreage. What fractions of the various size groups 
should we choose ? 

If we have, in the general case, a sample number in the ith stratum, 
where S(f|) srn, we shall take as our estimator of the mean of the whole 
population the statistic 
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where denotes the mean of the sample values from the tth stratum. 
This is an unbiased estimator in the sense of 23.17 for the mean value 
of is the same as the mean value of x^j over the tth stratum, i.e is a^. 

Furthermore, the variance of x will be given by 

var * = 2 I -^7 1 var 

_* [N,yN,-r, ^ 

approximately 

The reader may verify as an exercise that when r* is equal to as given 
by (23.15) this reduces to the minimal variance given by (23.11) to our 
degree of approximation, which is reached by writing Ni instead if ]V<— 1 
in the denominator. 

We do not know the standard deviations of the factor under investiga- 
tion in the various strata but we may make some very plausible assump- 
tions. There must clearly be some high correlation between arable 
acreage and farm area. Let us then suppose that the variability of the 
one is proportional to that of the other, i.e. that our sampling fraction can 
be taken as proportional to the standard deviations of size of farm. A 
sketch of the histogram of the data will show that the distribution is 
approximately J-shaped. If in any stratum the farms were distributed 
equally frequently with respect to size (i.e. if the histogram were actually 
the frequency distribution) the variance of a stratum of width h would be 
h*ll2 and hence its standard deviation would be proportional to h. Let 
us then choose our sampling fractions proportional to the widths of the 
size groups. 

The last group, 700 acres and over, has an unspecified upper limit. We 
will, therefore, suppose the standard deviation very large and sample 
100 per cent. The range of the other groups are 20, 75, 200 and 400 
acres and thus our fractions are proportional to these numbers, say 20x, 
75*, etc. We then have 

(20*)(101 .450) +(75*)(111,360)-1-(200*)(65,210) 
-}-(400*)(ll,150) — 39,000, say, giving 
*= 0-00140 

The fractions are then approximately 2-8, 10*5, 28 and 56 per cent. 

The %ures used in actual practice (though not obtained by this method) 
were 5, 10, 25, 50. .4s we shall see below, extreme preciaon in the 
sampling proportions is unnecessary. It was recognised tluit the smaller 
farms were over-represented, this being a deliberate modification intro- 
duced for other purposes. 
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We may form an idea of the relative efficiency of this method of sampling 
as compared with others which might suggest themselves. With sampling 
fractions 5, 10, 25, 50 and 100 per cent we have 


Nj 


Variance 

Hi (Farms 


Ni ^ 

(proportional to) 

sampled) 

101.450 

5 

20* 

5,072 

111,360 

10 

75* 

11,136 

65.210 

25 

200* 

16,302 

11,150 

50 

400* 

5,575 

1,430 

100 

— 

1,430 

Totals 290,600 



39,515 


Now from our expression for the variance of x we have 

We may now calculate this quantity, or rather a quantity proportional 
to it (since we are assuming the variances proportional to the squares of 
the widths of the grouping intervals). For instance the first term in the 
summation is — 

101,450x400x(20-l). 

We find that var t is proportional to 0'1896. We do not require the 

N 

variance of the last interval because the factor * >1 vanishes for it. 

It is also of interest to see what happens if we draw the same proportion 
from each of the five strata, a procedure which has a certain prior 
plausibility. The total sample number is 39,515/290,600=13-598 per 
cent. We shall now require an estimate of the variance in the last class 
of farm of 700 acres and over, and shall take it to be proportional to 400*. 
Denoting the sampling proportion by p we have, for an estimate of the 
mean w based on this method, 

- if'(r 0^ 

This formula gives us var w proportional to 0-3979, i.e. a variance more 
than twice as great as that obtained by the first method. * 

23.10 From the determination of the best ” sampling fractions by 
minimising the variance it follows that fractions near to the optimum 
will give almost as good results as the best. We may establish the result 
directly as follows. Let pi =«| /JV|. 
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Then 


var t = (23.16) 

Now suppose that instead of the optimum proportions p{ we choose 
proportions pt+Si where the S’s are small and may be neglected. Since 
the sample number is the same in both cases we have 


giving 




2(iVA)=0 


If « is the alternative estimate 


. ( 23 . 17 ) 


var 




and since, to our approximation 



Now pi is equal to aoi where a is a constant and consequently the second 
term vanishes in virtue of (23.17). Thus var u is practically the same 
as var t. 

The effect of this result is that we need not be too meticulous in deter- 
mining our sampling fractions. Any values near the optimum will give 
a sampling variance very near the minimum. 


23.11 Various elaborations of ordinary sampling or stratified sampling 
are possible and are sometimes employed. For example, we may sample 
in two stages, the second sample being a sub-sample of the members of 
the first sample ; and the method may be extended to further sub- 
samples. Suppose, for instance, that we require a comparatively small 
sample from the inhabitants of a certain country. For administrative 
reasons it may be more convenient to draw first of all a primary sample, 
consisting of towns and rural districts ; then, from each member of the 
sample, a number of houses ; and then, say, one member from each house. 
At some stage in the pij’ocess, e.g. in the selection of houses, we might 
have stratified. There is evidently a very large number of possiUe 
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combinations of different techniques in general, although in practice a 
limit is often imposed by cost or convenience. 

23.12 The student will inquire whether there is any advantage in these 
more complicated procedures from the theoretical viewjxiint ; whether, 
for example, it is possible to reduce the sampling error by sub-sampling. 
We shall only have the space for a brief discussion of this question. 

If all the sampling is random and the population is homogeneous there 
is no theoretical advantage in sub-sampling. An ordinary random 
sampling process gives each member of the population the same chance 
of being chosen. If we choose groups at random, and ihe menders of 
those groups may be regarded as having been dUoUed at random to the groups, 
the more complicated technique also gives each meihber the same chance 
of being chosen, and the methods are equivalent. 

23.13 In practice, however, the nature of the grouping is often known 
to be such that the members cannot be regarded as grouped at random, 
and the effect of stratiffcation or sub-sampling may be to alter the 
standard errors of estimation quite considerably. To take our former 
example of sampling from a human population ; there may be (and 
usually there is) a good prior reason to expect, that the quantity we are 
investigating differs between town and country districts, so that the 
population is patchy and, in any given area, there is a positive correlation 
between contiguous members of a sample ; or again, if we take only one 
member from a household we may exclude from occurrence certain 
coincidences or resemblances which are more likely to occur within a 
household than between households. This patchiness in the population 
may, or may not, be an advantage in reducing the standard error. There 
do not appear to be any very general rules on the subject and a great 
deal depends on the nature of the patchiness. It is nevertheless possible 
to make certain assumptions about certain types of population urith 
great confidence, and to base sampling techniques on them. 

Example 23.3. — k survey is carried out in a particular town. Certain 
households are chosen at random and then one member from each house- 
hold. Suppose the quantity under consideration is some continuous 
variate x. 

Let us suppose that the maximum number of members -ki a family is k, 
that there are Fj families with one member, F, with two members and so 
on. The total number of families we may write as F and the total number 
of individuals as N. Then we have 

( 23 . 18 ) 

/-I 

2 jPi = N 


. ( 23 . 18 ) 
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Let the mean and variance of * in the /th family of the set of F, familiiMt 
be fHji and respectively. Then if m and v are the mean and variance 
of the total population of individuals 

Nm=-Z I. jm„ (23.20) 

N{v*+tn*i = S 2 j{vi,+mj,) 

For an unrestricted random sample of n (small compared with N) from 
the whole population the variance of the mean is v /», say so t^t we 
have 


Va = 2 2 ^ 

Now suppose we take a random sample of n households and choose 
one member from each household. In such a case we are sampling from 
a population of F members, one from each member. The variance of 
such a population is given by V, say, where 

F + (^ S 2 tnJ =^2 2 

and hence the variance of the mean of samples of n, say Vj, is given by 
S 2 (,;,,+m;,)-| 2 2 . . (23.22) 

The reader will notice that the sampling variance can be exhibited 
in the form of an analysis of variance. If v is the mean of variances 
within families and is the variance of means (between families) we have 

. (23.23) 

fk 

From (23.21) v. can be put in a similar form but the mean of Vfi is weighted 
according to the number of members in a family and the sum corresponding 
to the is similarly weighted. 

A comparison of (23.21) and (23.22) will show that if the means and 
variances increase with size of family, or if the variances increase and 
the means remain constant, v. is greater than v„ for the huger families 
then contribute relatively more to v,. The situation might then arise 
in which we had a smaller sampling variance by choosing one member 
from each family in the sample. On the other hand we have to be carefnl 
not to obtain a biased estimate. In this case, the mean of a sample of «, 
one from each family, might be biased. For the mean of such a samide 
(over ail possible samples) is the same as the mean o| one member over 
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all possible samples consisting of one family, that is to say, is the un- 
weighted mean 

I k Pi 

= £ £ wa 

This may differ from the population mean given by (23.20). We must 
always be careful, therefore, in looking for estimates with minimum 
variance, not to choose one which may be seriously biased. 

23.14 At this point we may mention briefly certain other types of 
sampling which are sometimes used. In some of these cases the methods 
have not yet been put on a satisfactory theoretical basis and the reader 
who proposes to use them should read more widely before doing so.* 

(a) Systematic sampling. Where the members of a population are 
arranged in some spatial or temporal order (e.g. persons listed alpha- 
betically in a telephone directory, price quotations given regularly each 
week, plants growing in rows in a field) it is sometimes convenient to 
choose a sample by selecting members at equal intervals along the order. 
For instance we may select every 100th name on a list, or every fifth plant 
in a row. We referred in 16.26 to the selection of houses in a street and 
the dangers of occasional bias which it might introduce. Such methods 
have been called (not very aptly) systematic sampling. Where the 
population is patchy they have the appearance of avoiding selecting by 
chance too many members in an unrepresentative area. On the other 
hand, where there are rhythms present in the population (as, for example, 
in oscillatory time series or in soil which has been cultivated by machine) 
the method may give very unreliable results. It can only be recommended 
when there is good reason to think on prior grounds that the interval 
between members of the sample has no relation to any possible systematic 
properties of the population. 

(h) Quota sampling . — In social surveys involving interviews when the 
work has, in general, to be divided among a number of investigators it 
has sometimes been the practice to assign to each a definite sample number 
which he must attain — he may, for instance, be instructed to secure 
200 schedules, and to go on until he has obtained that number. This 
method would be unobjectionable if the sample were random, but un- 
fortunately circumstances may arise which vitiate the randomness. The 
investigator who meets with refusal to complete a schedule or otherwise 
fails to obtain one from a previously selected individual (e.g. because of 
his absence), must go on until the quota is full, and may be forced to take 
his sample where he can get it, not where he would like to get it. Checks 
and controls throughout are most desirable in this type of sampling. 


♦ See F. Yates, Sampling Methods for Censuses and Surveys, 1949, Griffin and Co., for 
an extended account of the subject and a bibliography. 
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(c) Sequential sampling , — ^This method (which has been put on a 
satisfactory theoretical basis, although many problems remain unsolved) 
aims at economising in the size of sample required to reach a prescribed 
degree of probability in making a correct decision. 

In the ordinary sampling process such as we have described it in fore- 
going chapters, we select a sample of pre-determined size and calculate 
from it the required estimate together with its standard error (or, for 
small samples, an equivalent quantity) which sets limits to the values 
between which the parameter value may be stated to lie to a prescribed 
degree of probability. In sequential sampling we invert the process to 
some extent. We decide, on the basis of the prescribed degree of 
probability, what are the limits within which we can accept the sample 
estimate as consistent with prescribed parameter values and then sample 
one by one. If at any stage the sample estimate (or more generally, 
some suitable statistic calculable from the sample) falls outside the limits 
appropriate to the size of sample which has been reached up to that point, 
we reject the hypothesis that the population parameter has the prescribed 
value or set of values under consideration. An excellent account of the 
method will be found in A. Wald’s Sequential Analysis, 

Example 23.4. — As an example of an inquiry which was spoilt by 
violating some of the principles we have proposed, we may take the 
Lanarkshire nutritional experiment which was undertaken in 1930 at a 
cost of ^7,500. For four months 5,000 children received three quarters 
of a pint of raw milk per day, 5,000 received the same quantity of 
pasteurised milk and another 10,000 were chosen as controls. The 
height and weight of the whole 20,000 were measured at the beginning 
and end of the experiment. 

The main object of the experiment, of course, was to see if the milk-fed 
groups gained more in height and weight than the controls, but for it to 
have any value as a basis of generalisation the samples had to be random. 
The intentions of the planners of the experiment were good. Teachers 
selected the children either by ballot or by some alphabetical system. 
But at this point a serious flaw occurred. " In any particular school where 
there was any group to which these methods had given an undue proportion 
of well-fed or ill-nourished children, others were substituted in order to 
gain a more level selection,” 

It is unfair to be too pritical of what was evidently a well-intentioned 
procedure to improve the representative quality of the data; but in 
fact this attempt to balance the samples nearly ruined the experiment. 
It was found at the end of the inquiry that the controls were both heavier 
and taller than the fed children by about three months* growth in weight 
and four months* growth in height. It appears that the substitutive 
process in what looked like unusual samples resulted in the choice of 
better nourished children as controls and worse nourished children as 
feeders. Comparability with controls was thereby invalidated. 
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A second object of the inquiry was to see whether there was any 
differential effect between raw and pasteurised milk. Here again a 
mistake was made. A particular school obtained either one kind of milk 
or the other, not both. Now in a ^strict which is radally or socially 
heterogeneous, it is possible that the selection of one half of the schools 
for one treatment might result in the choice of a set with higher or lower 
standards than the other half, both in the original measurements and 
in the rate of growrih. It would have been better to select a number for 
feeding with raw and an equal number for feeding with pasteurised milk 
in each school. 

There were other faults in the design of the experiment and the majority 
of the conclusions which were drawn from it ild not, strictly speaking, 
follow from the data. The student may consult " Student,” Biometrika, 
1931, 23, 398 for some further criticisms. 

Elimination of saiiq>ies 

23.15 The liability to error of the result of examination of a sample 
unit obviously depends to a high degree on the nature of the observation 
to be made. A simple ph 3 rsical measurement permits of a high degree 
of accuracy with little chance of bias, but even here care must 
be exercised, e.g. in taking body-measurements on the human subject 
to determine correctly the points between which the measurement is 
to be taken, and to use a constant degree of pressure in adjusting the 
instrument. If an estimaie is made, the possibility, indeed the probability, 
of error is at once greatly increased, as we have seen already in the estima- 
tion of shoot-height (16.21). The chances of error are widened yet further 
still if the unit is a human being and makes his own contribution towards 
misleading the observer, by giving untrue or ambiguous answers to his 
questions. In such interviewing work a knowledge of and familiarity 
with psychology may be of far more service to the investigator than a 
knowledge of statistical method. We will give some examples first of 
estimation and secondly of interviewing that will serve to illustrate the 
risks. 

Example 23.5 . — Corrections for pessimism 

Table 23.1 shows the forecasts of 3 delds in potatoes made on various 
dates as compared with final estimates, for a series of years. 

These forecasts and estimates are averages based on figures supplied by 
a number of estimators scattered over England and Wales. They are not 
Recked against actual yields, although some estimators use known results 
in their areas for particular farms and fields in arriving at their jtt<%ment. 
The striking thing about the figures is the uniform sign of the difference 
between the forecasts and the final estimate. 

This type of bias is quite different from the one noticed in the Example 
23.1. There the investigators measured the yield of definite areas and the 
bias apparently lay in their enthusiasm in extending those areas a Httle 
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TABLE 23.1. — ^FoKcatt* of yiddt of poUtoet in EngUnd and Wald in tons pK acn 

From the c&dal agcieultuxal sUtisUro 


Year 

Sei 

»t. Ist 

Oct. 1st 

Nov. Ist 

Final 

estimate 

Yield 

% 

difference 
from final 

Yield 

% 

difference 
from final 

Yield 

% 

difference 
from final 

1929 

5-7 

-17*4 

6*2 

-10*1 

6*5 

- 5*8 

6*9 

1930 

6-0 

- 7-7 

61 

- 6*2 

6*1 

- 6*2 

6*5 

1931 

5-5 

0-0 

5*3 

- 3*6 

5*3 

- 3*6 

5*5 

1932 

6*4 

- 30 

6*2 

- 6*1 

6*3 

- 4*5 

6*6 

1933 

6-4 

- 4‘5 

6*2 

- 7*5 

6*4 

- 4*5 

6*7 

1934 

60 

-15-5 

6*3 

-11*3 

6*7 

- 5*6 

7*1 

1935 

5*6 

- 9-7 

5*7 

- 5*1 

6*0 

- 3*2 

6*2 

1936 

6-0 

~ 3-2 

5*9 

- 4-8 

5*8 

- 6*5 

6*2 


too widely. Here the investigators are not measuring but judging and 
the bias arises from excessive caution, a kind of chronic pessimism which 
is well recognised in agricultural circles. The remedy would be either 
to lay down a series of harvesting experiments on properly chosen sites, 
or to ** correct ** forecasts in future by scaling them up proportionately 
to the average deficiency over a previous series of years. Given time, 
of course, it might also be possible to educate the observers out of their 
pessimism, but this would not be without its dangers and might for a 
time swing the balance the wrong way. 

Example 23.6. — When an investigator is sent out into the field to collect 
results he may, if he is lazy or dishonest, shirk his duties and send in 
returns which are spurious. Once these faked records have occurred it 
is difficult to detect them unless the inquiry has been specially designed 
to be self-checking in this respect, but various methods are available to 
check the general accuracy of the individual or to restrain his tendency 
to make entries by guesswork. One useful device is to have a second 
investigator cover some of the same ground. This results in a certain 
amount of duplication of effort but is often worth the extra trouble and 
expense. The two investigators need only have part of their field in 
common. The knowledge that any particular return is likely to be 
checked by another investigator is often a sufficient spur to accurate 
recording in all the records. 

Table 23.2 shows a comparison of two recordings by surveyors A and 
B made on identical fidds within a fortnight of each other. The aurvqrots 
merely had to record the crop under which each of 332 fields lay ami no 
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question of measurement or estimation was involved apart from the 
identification of the plants. 


TABLE 23.2. — Con^arlson of dtiplicated complete enumeratioii In a district of Bengal 

(Mahalanobis, /. Roy. Siot. Soc., 1946, 161, S2S.) 


A — Survey 

Jute 

Winter 

rice 

B — Survey 

Winter rice 
and jute 

No. crop 

Totals 

Jute. 

4 

15 

4 

3 

26 

Monsoon rice 

4 

12 

1 

4 

21 

Monsoon rice and jute 

17 

66 

2 

9 

94 

Jute, monsoon and 
winter rice 

— 

2 

— 

— 

2 

Rice (monsoon and 
winter) 

1 

— 

— 

— 

1 

No crop . 

j 37 

45 

4 

102 

188 

Totals 

1 63 

140 

11 

118 

332 


The discrepancies are obviodsly very large and it is impossible to avoid 
the conclusion that one of the stirveyors at least was not carrying out 
his duties properly. Errors on this scale ciui hardly be due to accident 
or inability. There is a strong presumption that one of the surveyors 
at least was either not exercising reasonable care or definitely falsif 3 dng 
his records. 

23.16 Unintentional errors on the part of investigators can to some extent 
be eliminated by training and careful instruction, and the magnitude of 
unconscious bias can often be gauged by letting them undertake a dummy 
inquiry on material for which the results are known. Where resources 
permit, however, it is very valuable to replicate the inquiry among 
different observers to see how far they differ among themselves. This 
is especially desirable in inquiries which necessarily depend on subjective 
judgment, such as the assessment of a candidates’ qualities in a personal 
interview, a grading by an inspector of the suitability of a house for 
habitation or the rating of an employee for promotion. 

Example 23.7. — In an inquiry into family budgets in Nagpur 
(Hahalanobis, J. Roy. Slat. Soc., 1946, 109, 325) information was collected, 
inter alia, of total income and of monthly expenditure. The area under 
examination was divided into five zones. Within each zone samples were 
selected by picking families at random and these were divided into four 
sub^^amples, each of which was random and independent of the others. 
There were four investigators, each taking one sub-sample at random 
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in each zone. Within each sub-sample about 50 schedules were collected. 
Thus the total of about 1,000 schedules (actually 997 because of small 
imperfections in carrying out the design) can be classified into a 5 x4 x50 
grouping, and the variance-analysis is of the following form. 

TABLE 23.3. — ^Nagpur Family Budget Inquiry 

Analysis of Variance 

(For ref. see Table 23.2) 


Variation 

i 

d.f. 

Quotient 

(Income) 

Quotient 

(Monthly expenditure) 

Between zones (Z) . 

i 

4 

! 

1 4,439*6 

3,707*9 

Between investigators (I) 

3 

85*4 

597*1 

Interaction (ZI) 

12 

1 382*5 

397*3 

Between sub-samples 

19 

1,189*7 

1,127*1 

Within sub-samples . 

977 

401*6 

384*7 

Total 

996 

424-7 1 

398*9 


We have shown only the degrees of freedom and the quotients in the 
table. If the reader multiplies the two to obtain the sum of squares 
he will find that the sums " between sub-samples ” and " within sub- 
samples ” do not add to the total sum. This, of course, is due to the 
fact that the numbers in sub-classes are not units but are about 50. 

The analysis shows the interaction between zones and investigators. 
If there were only one schedule in the sub-sample there would only be 
19 degrees of freedom altogether ; but as there are about 50 schedules in 
the sub-samples we can form an estimate of the variance within sub- 
samples by taking the variance of each set of schedules in a sub-sample 
and pooling for the 20 sub-samples. It is this " residual " variance 
(401*6 for income and 384*7 for monthly expenditure) which is to be 
compared with the other variances to test departure from homogeneity. 

Taking income first, we find that the ratios of the residual quotient 
to the quotients between investigators and the interaction are not 
significant. This is encouraging and indicates that the investigators are 
accurate (or at least consistent). The quotients between zones and 
between sub-samples are significant at a 1 per cent level. This was to 
be expected from the nature of the inquiry, for the zones were deliberately 
chosen from differentiated areas. 

A similar conclusion is reached in respect of monthly expenditure. 
The reader can verify the arithmetic of the significance for hims^. 

23.17 To avoid confusion we refer at this point to a technical meaning 
of the word “ bias ” which has recently come into use in advanced 
theoretical statistics. A statistic t which is used as an estimator of a 




TABLE 23.4. — Bengal acp survey 

Com|>ariso]| of two independent estimates of proportion p under winter rice by two parties of investigators 


54S 



1,510 353 272 ! 226 280 288 277 288 I 355 404 605 1,346 6,204 
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parameter 6 is said to be biased if the mean value of i over all possible 
samples is not equal to 0. Thus, as we saw in 21.4, the sample-variant 
is a biased estimator of the parent variance because the average value of s* 
over all samples is (n— 1) /n times instead of /t, itself. To obtain an 
unbiased estimator we must use the statistic 

s'* =S(*— ^)*/(n— 1) 

The meaning attached to the word "bias” in this chapter is not 
restricted to departure from the criterion we have just mentioned. In 
the narrower sense of that criterion “ bias " is a quality of the estimator 
employed and may exist when the sampling is random. In the more 
general sense bias may be used to connote any effect which distorts the 
representativeness of the result, whether in the est imati n g process or in 
the selection and examination of the sample. 

Cumulative effect of bias 

23.18 There is a popular belief that even if individuals make mistakes 
their errors in the aggregate will tend to cancel out, so that an average 
of a number of instances will be less distorted by bias than any particular 
single instance. To some extent this is true. If the errors are in the 
nature of sampling fluctuations we know that the standard error of a mean 
decreases proportionately to the square root of the number of observations. 
But it would be a mist^e to assume that all types of bias tend to be of 
the self-cancelling kind. It is not true that if only enough people make 
enon^ mistakes the average of their opinions or estimates lies near the 
real value. 

23.19 We have had one example of the cumulative effect of bias in 
Example 23.5, in which we saw that, in spite of the number of crop 
estimators concerned, the mean of their forecasts was systematically 
below the final estimate. Evidently they were all affected more or less 
by the same tendency which therefore persists in the average of the 
individual results. How far, in any particular inquiry, we may assume 
that individual biases tend to cancel in the aggregate depends on the 
nature of the inquiry. We clearly cannot assume that there is safety in 
numbers where individuals may be affected by the same kind of bias, 
e.f . if there is any general tradency to over-^timate for reasons of personal 
pride, or where some force is at work to remove from the sample mdivMuals 
of one particular type. On the other hand, cases are l^wn wherein 
biases (not merely chance fluctuations) do appear to cancel themsdives 
out very largely. 

Example 23B. — (Data from Mahalanobis, he. dt. Examine 23.7). 

A certain area of 6,204 " grids ” of about 2| acres each was surveyud ' 
independently by two parties A and B. Each party recorded for e^ 
grid the estimated proportion under winter rice. The results are Bhow^ 
m Table 23.4» 
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If the two parties were in complete agreement only diagonal cells 
would contain non-zero entries. The differences are evidently quite 
substantial, there being only 51 *6 per cent of the cases showing complete 
agreement. 

Nevertheless the mean of p for A (mean of column totals) is 52-0 per 
cent, whereas that for B (row totals) is 51*9 per cent, an extraordinarily 
close agreement. Thus, in spite of the differences on individual grids the 
estimates for the whole are satisfactorily concordant. 

Example 23.9. — The " vanity ” effect. 

The preceding examples have related to defects on the part of the 
observers. We now consider a different type in which bias is introduced 
by a distorted response from the members of the samples. 

In an inquiry into listeners’ preferences for radio programmes subjects 
were asked by interview for their opinion on broadcast religious services. 
52 per cent of the persons indicated by their response, in the interviewer's 
judgment, that they were enthusiastic or moderately enthusiastic. One 
might have been tempted to infer that about half the listening public were 
keen listeners to religious broadcasts. In fact the listening audience 
seemed to be about 10 per cent of the listening population, another and 
more direct inquiry into the audition of actual programmes giving 
proportions ranging from 3 per cent to 18 per cent (See Silvey, J. Roy. 
Slot. Soc., 1944, 107, 190 for details). 

Without dwelling on questions of standard error we can see at once 
that the responses in the interviews were strongly biased. There can be 
little doubt that this was due to the wish on the subject’s part not to 
be dassiffed as indifferent to spiritual influences. The same kind of effect 
is apt to arise in any inquiry into cultural tastes, few people being wrilling 
to admit to a stranger that they do not care for good music, however 
rarely they go to the trouble of listening to it. 

Example 23.10. — The " sympathy ” effect. 

The Listener is a British weekly journal devoted to broadcasting matters. 
An inquiry was made to find out how many people read it. Now in this 
case the circulation of the journal is known and, by making due allowances 
for the numbers of people who read the same copy in family units, a fair 
estimate can be obtained of the total number of people igho can possibly 
read one issue. The percentages obtained from sampling inquiries showed 
that four or five times as many people said they read it as co^d have done 
so. (See the remarks by Durant on the paper by Silvey referred to in 
the previous example.) 

It would not be correct to deduce that the majority of the people 
repljdng affirmatively to the question whether they read the Listener are 
ddiberate liars. There is a natural tendency on the part of many people 
to give to the questioner the reply which they think would please him. 
They infer that an affirmative answer would do so (thinking, perhaps. 
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that the questioner is a representative of the publishers) and stretch their 
consciences to the extent of saying that they read the journal when they 
may, for instance, only have seen it on a bookstall or in a friend’s house, 
or even if they have merely seen it advertised. This " S 3 mipathy '* 
response is ail the more difficult to guard agadnst because the interviewer 
must try to ingratiate himself with his subject in order to obtain a reply 
at all. 

In this particular case there is another possible explanation of the bias. 
The subject may imagine that if he gives a negative response an attempt 
will be made to sell him the journal. He therefore anticipates any possible 
sales pressure by stating that he takes the journal already. 

23.20 The lessons to be learnt from such experiences as these are 
numerous. We will indicate a few methods which the investigator 
may sometimes be able to use to minimise the risk of the distorted 
response. 

(а) If possible the aim of the inquiry should be concealed from the 
subject. This will prevent him from " co-operating ” with the interviewer 
to get what he may consider the desired result. But it is often im- 
practicable to expect him to answer questions without asking some in 
return ; and very often the purpose of the inquiry is clear merely from the 
fact that it is made. 

(б) The questions should be framed unambiguously so as to elicit a 
“ yes-no ” response or a three-way answer customary in opinion inquiries : 
“ yes /no /don’t-know ". 

(c) Independent checks on veracity can sometimes be obtained in a 
roundabout way. In Example 23.9 we mentioned a case where a direct 
check was available. An inquiry on a political subject, for example, may 
well embody some question which permits of checking against known 
results for the aggregate, such as " Did you vote at the last election ? ” 
The interpretation of the results of these " control ” questions is not 
always very easy, but they provide valuable collateral evidence on the 
general representative character of the responses. 

{d) If there is prior reason to suppose that different types of subject 
will give varying degrees of distortion in response, results for the types may 
be analysed separately. Suppose we are conducting by personal interview 
an inquiry which involves recording the subject’s age. Knowing that the 
incentive to lie about age varies from one age-group to another, we may 
analyse the replies, if they are sufficiently numerous, into age-groups. 
From known census data or by making certain assumptions about the 
population under examination based on known facts such as birth-rates 
and death-rates, we can estimate what the results ought to be if the 
subjects are telling the truth, and hence gauge the direction and extent 
of the bias. 
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SUMMARY 

1. The complete sampling process consists of (a) the choice of unit, 
(b) the selection of the sample of units and (c) the examination of the units. 

2. For " continuous " regions there is usually no natural unit ; and for 
a disconcontinuous population practical considerations may suggest, as 
size of unit, groups of the individuals comprising the population. 

3. By the use of appropriate variable sampling fractions in stratified 
sampling a considerable reduction may be made in the sampling variance 
of estimates of the mean. For linear estimates the optimum estimate ii 
given when the numbers taken from the strata are proportional to the 
standard deviations of the variate under investigation in those strata. \ 

4. Various examples are given of the introduction of bias, due to flaws) 
in the " examination *’ of the sample. 


EXERCISES 

23. 1 Consider possible sources of bias in replies to the following enquiries : 

(a) Persons are asked to state how often they attended a place of 
entertainment during the previous year ; 

(b) Persons are asked to state how many da 3 rs have elapsed since they 
last attended a place of entertainment. 

Consider how far the answers to (b) may be used as a check on the answers 
to (a). 


23.2 Ten investigators are to be sent to ten traffic centres in a dty to 
record the number of automobiles passing a specified point in a spedfied 
time. Two of the investigators are suspected of being unreliable. Design 
a method of carrying out the inquiry which will exhibit this unreliability, 
if it exists, and will also provide unbiased results if the other investigators 
are reliable. 

23.3 A number of businesses are asked to provide figures showing stocks 
of spedfied goods on hand at a specified date, and the returns are required 
within a spedfied and rather short time. Consider what kinds of bias 
might appear in the answers. 

23.4 A random sample is drawn from the records of a fire insuraAM 
company mth the object of estimating the number of fire “ incidents ” 
occurring in a certain period in dwelling houses. Consider how far this 
sample k likely to be unrepresentative of all fire “ inddents ” which teqtut* 
tibe attdition of a public fire service. 
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23.5 If equation (23.10) may be accepted as self-evident, provide a 
simplified proof of the result of equation (23.11). In the manner of 23.10 
derive equation (23.13). 

23.6 A population is stratified into four (large) groups for which the 
number of members and the variances are as follows — 


Group 

Number 

Variance 

1 

10,000 

16 

2 

20,000 

25 

3 

40,000 

36 

4 

30,000 

4 


Find the variance of an estimate of the parent-mean based on a sample 
of 400 from the population 

(а) by taking 100 from each stratum ; 

(б) by taking a constant proportion 0*4 per cent from each stratum ; 

(c) by choosing the sample numbers (to the nearest unit) proportionately 
to the standard deviations in the strata ; 

(d) by taking the sample numbers as the optimum, as given by (23.15). 

23.7 A population consists of N members in order, divided into k groups 
of n. A sample is selected by taking the yth member of each group, so 
that it is systematic and consists of the members Xf, Xf+^, Xf+t^, etc. 
Show that the variance of the mean of the sample, say x, is given by 


var X 




where v is the variance of the population and p is the intraclass correlation 
coefficient of the n groups of k consisting of yth members (y=l, . . . A). 
Hence show that var x is greater than, equal to, or less than the variance 
of a random sample according as the intraclass correlation is positive, 
zero or negative. It may be assumed that N is large compared with k. 

23.8 A sample is drawn from an ordered population of N{=kn) members 
by dividing it into k sets of n and taking a member at random in each of 
the k sets. Consider generally whether the variance *of the mean of such 
a sample will have a smaller variance than the mean of an unrestricted 
random sample. 

23.9 One of the main difficulties in house-to-house inquiries is to make 
proper allowance for those houses where there is no one at home when the 
call is made. It has been suggested that suitable methods of dealil^; 
with this problem would be 
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(a) to call back persistently until an occupant was found to be at home ; 

(b) to sub-sample the non-responsive houses by calling back persistently 
at a proportion of them ; 

(c) if possible, to stratify houses beforehand according to the proportion 
of the day during which somebody was at home, and to sample at 
random in each stratum, ignoring the non-responders. 

Examine the relative merits of these methods. 


23.10 Discuss the problems of obtaining estimates of average annual 
values in the following cases : , 

(a) Expenditure of persons on holidays by sampling at various date^ 
in the year ; \ 

(i) rainfall at a certain locality by sampling for rainfall on a specified 
number of days ; 


(c) output of a factory product by sampling output on certain dates. 



CHAPTBR tWENTV-FOUR 


INTERPOLATION AND GRADUATION 


Simple interpolation 

24.1 If the value of a function of a single variable x, say w,, has been 
tabulated for equidistant values of the variable x, x+h, x+2h, etc., we 
often require to find the value of the function corresponding to an inter- 
mediate value of the variable. Functions in very general use, such 
as common logarithms, have usually been tabulated with intervals so small 
that even over a range of several intervals the relation between and * 
may be assumed to be effectively linear, that is of the form 

«* = a^+UiX .... (24.1) 

as is shown by the constancy of the differences between successive values 
of u. For example, 

TABLE 24.1 


Number 

Logarithm 

Difference {+) 

30597 

4*4856788 

0-0000142 

30598 

4*4856930 

0-0000142 

30599 

4-4857072 

0-0000142 

30600 

4-4857214 

0-0000142 

30601 

4-4857356 

0*0000142 

30602 

4-4857498 



If we then require, say, the value of log 30600*3, it is sufficient to use the 
familiar process of simple interpolation— 

log 30600 4*4857214 

0*3 x0*0000142 43 

4*4857257 

The little multiplication sum, is, in most tables, already done for us in the 
margin. 
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Differences 

24.2 For any function which has been tabulated to sufficiently fine 

intervals (within certain limitations) simple interpolation can be used in 
this way — it is only a question of making the intervals sufficiently smaU 
(see below, 24.16). But many functions have not been tabulated in such 
detail, successive differences are not equal, and consequently simple 
interpolation cannot give an accurate result. The problem then arises, 
how are we to interpolate with reasonable precision ? And the answer is 
given by proceeding to higher orders of differences, as they are termed ; Le^ 
instead of considering only the differences I 

A,^==«i— «o ^ 

V = «8-«» 

etc., we also consider the second differences 

Ao*=A,‘-A„» 

Ai*=A.i-A,^ 

A,»=A,»-A,^ 

etc., or even the third differences, fourth differences, etc. 

24.3 To take an actual example. Table 24.2 shows the squares of the 
first few natural numbers, together with their first and second differences. 
Following a practice which is convenient for printing and for most purposes 
of practical work, each difference is printed, not on a line between the 
two figures to which it relates, as with the logarithms in Table 24.1 above, 
but on the same line as the upper figure of the two concerned — the line 
of the figure' subtracted ; and as the signs of the differences are constant 
for each column this sign is simply stated at the top. 

TABLE 24.2 


Number 

» 

Square 

Um 

First dilf. 

A‘(+) 

Second difi. 
A»(+) 

Third diff. 

A* 

0 

0 

1 

2 

0 

1 

1 

3 

2 

0 

2 

4 

5 

2 

0 

3 

9 

7 

2 

— 

4 

16 

9 




5 

25 i 

i 

— 




Here we see that the fir^ differences — the only ones with which we 
have been concerned hitherto — are no longer constant ; but they follow a 
ample rule, in that they are an arithmetic series, a linear function of x. 
As a result, the second differences are constant, actually +2, and con- 
sequmitly the third differences vanish. 
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244 The figures on the first line of such a table are called the leading 
term (0) and the leading differences (+1» +2, 0), and it is evident that, 
given the leading term and the leading differences, the whole table could 
be built up by successive addition as far as we pleased, without ralmlating 
any square directly except for checking. The series of first differences 
would be obtained by adding 2 over and over again, starting from the 
leading difference 1, i.e. 1+2=3, 3+2=5, etc. The squares would be 
given then by adding these differences in succession to the leading term 0 : 
0+1=1; 1+3=4; 4+5=9, etc. 

Differmces of a polynomial 

24.5 From these results we may conclude quite generally that the second 
differences of any pol 3 momial of the second degree, 

«*= .... (24.2) 

are constant and the third differences vanish. For, if we multiply all the 
squares in Table 24.2 by any factor Aj. we merely multiply all the differences 
of every order by the same factor ; and the linear part of the function, 
«o+*i*, cannot contribute to second differences. 

Below we give a similar table. Table 24.3, for the cubes of the first few 
natural numbers, and here it will be seen that third differences are constant 

TABLE 24.3 


Number 

Cube 

First diff. 

Second diff. 

Third difi. 1 

Fourth diff. 

X 

Ux 

A‘(+) 

A*(+) 

A*(-|-) I 

A* 


H 


0 

1 


0 16 6 0 
1 7 12 6 0 

8 19 18 6— 

27 37 24 — — 

64 61 — — — 

125 — — — — 


and fourth differences vanish. By similar reasoning we may conclude 
that the third differences of any polynomial of the third degree, 

• • • (24.3) 

are constant and the fourth differences vanish. The student will be quite 
correct if he draws the general conclusion that for a polynomial of the rth 

degree, ^ +a^ . . (24.4) 

the rth differences are constant and the (r+l)th differences vanish. To 
prove this it is only necessary to note that each successive differencing 
lowers the degree of a pol 3 momial by unity, for the difference of any term 

(*+l)*— ** = ... 4-1 

which is a polynomial of d^ee (^—1). 
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Newton’s tonnato 


24.6 Evidently these results hold out some possibility of generalising 
our method of interpolation. If, instead of only considering two successive 
values of say Wg and h^, and using the linear relation between m, and x 
that will reproduce these values to give any required intermediate value 
of we can use the pol 3 momial of the second degree which will reproduce 
three adjacent values, Wg, u^, or that of the third degree which will 
reproduce four, «g, «j, «,, «„ and evidently we shall be likely to get much 
more precise results. But to do this we must be able to obtain the required 
polynomials in terms of the differences. We shall use the notation already 



Further, the common interval for the values of x will be taken as unity, 
as shown ; in practical work this is always treated as the unit until the 
end of the work, just as the class-interval is so treated when calculating 
the moments of a frequency-distribution. 

24.7 Now write down the leading term and leading differences at the 
head of a table with spacious columns, as below, up to the leading fourth 
difference, and fill in the rest of the table working back from right to 
left. In column 5 for third differences we can fill in only the second 
space, Ag*-f-Ag*. In column 4 for second differences the second term 
will be Ag*-fAg* (alwa)^ adding from the line above to the right) ; the 
third term will be Ag*-|-2Ag*-fAg*. We leave the student to supply the 
remainder. 
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Now look at the numerical coefficients in the expressions for Ut, u,, «•, 
etc.; they run 

1 

1+1 
1 + 2+1 
1 +3+3+1 
1+4 +6+4+1 


These are familiar figures ; they are the terms in the binomial expansions 
of (1+1)*, (1+1)S (1+1)*. (1+1)*. etc. We then have, generally. 


«« = «o+*V+ 


x(x- 


1.2 


■>v- 


. . . (M.5) 


where the series of differences may be continued so far as is necessary to 
give a result of the precision desired. This important equation is known 
as Newton’s Rule or Newton’s Formula. It may be repeated that in this 
form of the equation the unit of * is the interval. There axe many other 
formulae of interpolation, but we propose to limit ourselves to t^ and 
illustrate its uses. 


24.8 It will be seen that, if the series on the right of (24.5) is terminated 
at Ao', the expression is a polynomial of the rth degree in x, though it 
is not arranged according to powers of x but according to the successive 
orders of difference, which is more convenient for our present purpose. 
This polynomial passes through the f+1 successive points (0, «o), (1, «i), 
(2, «,), . . . (r, Ur). In particular, if the series terminates at Ao‘, we 
have simple interpolation and the polynomial reduces to the straight line 
passing through (0, Mq) and (1, Wi). If it terminates at Ao*, the series 
represents a paralwla of the second degree passing through the three points 
(0, «o). (1. »i). (2, t*t)- If it terminates at A©*, it represents a pol 3 momial 
of the third degree passing through the four points (0, Uq), (1, m^), (2, «,), 
(3, M|) ; and so on. But the student must remember that even though 
the polynomial reproduces the values of the function at 0, 1, 2 and 3, it 
does not necessarily closely reproduce the function at intermediate values 
of X. The whole utility of the formula is dependent on the closeness with 
which the variable can be represented locally by a pol 3 nnomial of fairly low 
degree. Most ordinary functions satisfy this condition when tabrdated 
for small intervals, but occasionally the student may find himself in 
difficulties. We will give some examples in later sections. 

We now proceed to some illustrations, and will give a warning at once : 
the student must be very careful as to signs. 

Example 24.1. — Given the cubes below, required to find the cube of 
32-4. 

We give this first as an example in which the inteipolaticm is exatt, 
for the third diffmences are cdnstant. so that 'we need not inoceed faitluir. 
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Number 

Cube 

AM+) 

A*(+) 


31 

29791 

2977 

192 

6 

32 

32768 

3169 

198 

6 

33 

35937 

3367 

204 

— 

34 

39304 

3571 

... 



35 

42875 

— 

— 

— 


As interpolation is exact, it does not matter which term we take as 
u,. Supposing we take an origin at x—32. Then for 32 •4, x=0-4, and 

W0 lift, VC 

= 32768+0-4(3169) -0- 12(198) +d-064(6) 

= 32768+1267-6-23-76 +0-384 


= 34012-224 

This may be verified by direct multiplication, or from' Barlow’s Tables: 
the student is recommended to carry out a check by taking an origin at 
x=31. 

Example 24.2. — Given the following cube roots, find the cube root of 
102-5. The differences have been written, as is frequently done, without 
the insertion of the decimal point. 


Number 

Cube Rcx)t 

A»(+) 

A*(-) 

A*( + ) 

101 

4-6570095 

153192 

997 

14 

102 

4-6723287 

152195 

983 


103 

4-6875482 

151212 



104 

4-7026694 





Here, if we wish to attain the greatest pjossible precision and include the 
third difference, we can only take an origin at 101; x is then 1-5, and 


«i., = «o+1-5A,H0-375A**-0-0625A,» 

= 4-6570095 +0 02297880-0-00003739-0-00000009 
= 4-67995082 

Here we have retained an extra place of decimals throughout the arith- 
metic in order to get the seventh place correct in the final result, and must 
round this off to 4.6799508. Even so, we cannot avoid the effect of errors 
in our data, viz. the errors of rounding off, in the seventh place of decimals, 
the tabulated cube roots ; the seventh place in our answer is still liable 
to an error of ±1 to ±2 for this reason. 

It may be noted that, as differences converge so rapidly in this example, 
rimide interpolation would give an error of little more than a unit in the 
ilUi ^ace of deamals. 
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Example 24.3. — From the table of Ordinates of the Normal Curve 
(Appendix Table 1) find the value of the ordinate at %/a=0*045. 

We give this example partly as a warning to the student to see that 
his differences are converging so as to be likely to give a good result. 
The second difference is numerically much larger than the first, viz. 
392 against 199 ; he must then look at the third as well ; if this'be large 
also, he may have to go to a high order of differences to get precision. 
But the third difference is only +18 and the fourth difference smaller 
still, so third differences will sufiice for the highest precision attainable 
with the five-figure table. Note that the first difference is negative, the 
second negative, the third positive, and since the interval is 0- 1, z=0*45, 
not 0*045. 

In the difference terms we have retained two decimals beyond the 
five during the work (separated by a comma) — 

«o*46 = «o+0*45Ao^-0-12375Ao*+0 0639375Ao* 

= 0*39894 -0*00089,55 +0-00048, 51 +0-00001,15 

= 0*39854 rounded off to the fifth place 

Interpolating in the seven-figure table. Table II in Tables for Statisticians 
and Biometricians, this is found correct to the last place. It may be 
noted that, if a calculating machine is used, the products given by succes- 
sive terms can be cumulated on the machine. 

Interpolation of statistical series • 

24.9 So far we have dealt with straightforward interpolation of tabulated 
mathematical functions. But interpolation may also be employed on 
statistical series, or series of figures founded on statistics, provided at 
least that they run tolerably smoothly. No statistical series or series 
founded on statistics does, however, run absolutely smoothly, like a 
mathematical function, unless of course it has been deliberately 
" graduated to do so. It must be recognised, therefore, in such cases 
that we are merely using interpolation as a method of estimating the truth ; 
and the truth in all probability would not and could not be given by any 
process of interpolation. 

The following is an illustration of a series based on statistics. 

Example 24.4. — In Part II of the Supplement to the 75th Report 
of the Registrar-General for England and Wales, abridged life-tables 
were given for a number of counties, etc. The table below shows the 
expectation of life at ages 25, 35, etc. to 85, based on the mortality of 
males in Cambridgeshire in 1910-12, i.e. the average number of years 
that individuals would have lived from the given age onwards, if subjected 
at each age to the mortality mentioned. Required, to interpolate values 
for the expectation of life at ages 30, 40, etc. 
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Age 

Expectation 
of life 
(Males) 

A' 

A* 

A» 

25 

42*21 

- 824 

•f 20 

+ 34 

35 

33*97 

- 804 

■h 54 

+ 27 

45 

25*93 

- 750 

+ 81 

+ 76 

55 

18*43 

- 669 

-fl57 

- 3 

65 

11*74 

- 512 

+ 154 

— 

75 

6*62 

- 358 

— 

— 

85 

3*04 

— 

— 

— 

Total 



-3917 

+ 466 

+ 134 

Bottom figures less top 

-39*17 

+ 466 

+ 134 

— 


Tables of mathematical functions will often give the differences, but 
in dealing with data of this kind the student will certainly have to form 
them himself, and should carry out the check shown. Having formed the 
column of first differences, he should take the total, of course pa 3 dng 
attention to signs. In this case the total of first differences is —3917, 
or inserting the decimal point, —39-17. This obviously must be equal 
to the difference between the bottom figure and the top figure in the 
preceding column, as we see is the case. The following columns must 
be checked similarly. 

The second differences are considerably smaller than the first differences. 
Third differences are also small, but rather irregular ; it will be found, 
however, that the contributions of the third differences affect only the second 
place of decimals in the function, so we ought to attain a very fair result. 

To get the figures for ages 30 and 40 we have not much choice and must 
use the known values at ages 25 to 55. On general grounds it seems 
best to keep the value of x for which we require «, near the centre of the 
values used for interpolation. So the expectation at 50 was determined 
from the values at 35 to 65, that at 60 from the values at 45 to 75, and 
that at 70 from the values at 55 to 85. The expectation at 80 was 
determined with the use of the second difference only from the values at 
65, 75, 85. 

The work is quite straightforward and the results were : 30, 38*09 ; 
40, 29-90; 50, 22-10 ; 60, 14-94 ; 70, 8-99 ; 80, 4-64. The student 
may find it instructive to draw a chart. 

But some qualms were felt as to how far the results could be trusted. 
A pol 3 monual is not a very good function to represent an empirical function 
of the present kind which is slowly dropping to zero (see below, 24.12). 
It might possibly be more appropriate to take logarithms of the expect- 
ations, interpolate between the logarithms and then convert back Into 
numbers. The test was carried out as a control. The following are then 
the data and the differences — 
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Age 

log 

(Expectation) 


A* 

A» 

25 

1*62542 

-0*09432 

-0*02298 

-0-00799 

35 

1*53110 

-0*11730 

-0*03097 

-0*01662 

45 

1*41380 

-0*14827 

-0 04759 

-0*00536 

55 

1 *26553 

-0*19586 

-0*05295 

-0*03623 

65 

1*06967 

-0*24881 

-0*08918 


75 

0-82086 

-0-33799 

— 



85 

0*48287 

— 

— 

— 

Total 

— 

-1*14255 

-0*24367 

-0-06620 

Bottom figures less top 

-1*14255 

-0*24367 

-0*06620 

1 

— 


The work was done exactly as before, except that the expectation at 
80 was obtained with three differences from the given values at 55 to 85. 
The results differed only very slightly from those obtained before, the 
following table giving a complete comparison — 


Age 

Interpc 

>lation 

Difference 

Direct 

Logarithmic 

25 

42-21 

42*21 

— — 

30 

38*09 

38*07 

-0*02 

35 

33*97 

33*97 

— 

40 

29*90 

29-91 

•f0*01 

45 

25*93 

25*93 

— 

50 

22*10 

22*11 

4-0*01 

55 

18-43 

18*43 

— 

60 

14*94 

14*92 

-0*02 

65 

11-74 

11*74 

— 

70 

8-99 

9-00 

4-0*01 

75 

6*62 

6*62 

— 

80 

4-64 

4*63 

-0*01 

85 

3*04 

3*04 

— — 


The differences are almost immaterial. 

Notea on tte practical work 

24.10 Humber of differences to use . — Provided differences converge fairly 
rapidly and continuously, there is little difficulty in coming to a dedsifui. 
The student knows to how many digits he desires to be accurate, and it 
is no use his going on to higher orders of difference which affect only 
places beyond this ; if he wants four-figure accuracy, it is no good Ms 
going on to differences which affect only the sixth and seventh {daces. 
To enable him to see more quickly the approximate amtribution that 
a difference of any order will give, the fr^owing talde of the binomial 
coefficients may be useful— 
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TAUiE 24.4.- -TiUe of the Mnomtol ooclHdcnts In Newton’$ fomudn from 
#■>0 to 4r>n>2 by inttrvnls of 0-1 


X 

1 

C4 

1 

1 

;r(;r — 1 ) — 2) (;«r — 3) 

1.2 

1.2.3 

1.2. 3. 4 

0 

0 

0 

0 

01 

-0 045 

+0*0285 

-0*0206625 

0-2 

-0*08 

+0*048 

-0*0336 

0-3 

-0105 

+0*0595 

-0*0401625 

0*4 

-012 

+ 0*064 

-0*0416 

0-5 

-0125 

+0 0625 

-0*0390625 

0-6 

-012 

+0*056 

-0*0336 

0*7 

-0*105 

+ 0*0455 

-0*0261625 

0-8 

-0*08 

+0*032 

-0*0176 

0-9 

-0*045 

+ 0*0165 

-0*0086625 

1-0 

0 

0 

0 

11 

4-0*055 

-0*0165 

+ 0*0078375 

1-2 

+0*12 

-0*032 

+0*0144 

1*3 

4-0*195 

-0*0455 

+ 0*0193375 

1-4 

+0*28 

-0*056 

+ 0*0224 

1-5 

+0*375 

-0*0625 

+0*0234375 

1-6 

+0*48 

-0*064 

+0*0224 

1-7 

+ 0-595 

-0*0595 

+0*0193375 

1-8 

+0-72 

-0*048 

+0*0144 

1-9 

+ 0*855 

-0*0285 

+ 0*0078375 

2-0 

+ 1 

1 

” I 

0 


A word of warning may, however, be desirable. Because the use of the 
(r-l-l)th difference would not affect the result in the ftth figure, it does 
not necessarily follow that this pol 3 momial value will agree with the true 
value of the function to the Ath figure. 

If differences do not converge rapidly and continuously, this is in 
itself evidence that a pol 3 momial of moderately high order does not fit 
the function well and high precision cannot be expected. The student 
may occasionally find himself faced by cases more difficult than those of 
the foregoing illustrations. For example, here are the initial values of 
P for values of x* proceeding by unity, and degrees of freedom v— 6 
(»i'a=7), from Table XII in Tables for Statisticians, etc.. Part / — 


X* 

P 

X* 

P 

0 

1*000000 

5 

0*543813 

1 

0*985612 

6 

0*423190 


0*919699 

7 

0*320847 


0*808847 

8 

0*238103 

■I 

0*676676 

9 

0*173578 
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If we wish to find by interpolation the value at, say, 0*5, apparently we 
have no choice but to take our m, at zero, for the table starts there. If 
the student begins work accordingly, he will find his differences not 
behaving at all nicely ; the second leading difference is much greater than 
the first ; the third is a good deal less, but the fourth, fifth and sixth 
much larger than the third, and it is not until the seventh and higher 
differences that definite convergence seems to be setting in. If he 
laboriously works step by step, getting successive approximations to the 
value of P at 0-5 by using one difference, two differences and so on, he 
will get a series of very slowly converging values — 

1. 0-992806 

2. 0-999247 

3. 0-999658 

4. 0-998993 

5. 0-998445 

6. 0-998131 

•7. 0-997973 

8. 0-997899 

9. 0-997865 

The true value is 0-997839, and he could have obtained this much quicker 
by direct calculation ; even with the nine differences he has got only four- 
figure accuracy. But he ought not to have expected a good result if he 
had taken the trouble to look at the run of the differences. The figures 
give another useful warning. Using three differences, we have a worse 
result than when using two only. Increasing the number of differences by 
one step does not necessarily increase precision. 

Limitation of the number of differences suitable for use, owing to the 
effect on differences of errors of rounding off, is considered below (24.14 
and 24.15). 

24.11 Choice of the set of u's . — ^To interpolate, say, at x=2-5, using 
third differences, one might employ either the «’s at 0, 1, 2, 3, or those 
at 1, 2, 3, 4, or those at 2, 3, 4, 5 ; one would not go outside these limits or 
one would have to extrapolate for the value at 2 - 5, and that would obviously 
be unsafe. Which set is it best to choose ? Advice cannot be absolutely 
definite, but it would seem that usually (but not necessarily) values about 
equidistant from that sought should be equally valuable as guides, and on 
this principle we should try and keep the value sought so far as possible 
central to the set of «’s employed. 

This suggests that oru reason for our getting so poor a result above was 
that we used such a lop-sided set of «’s, with the value sought apparently 
unavoidably near one end. Let us avoid this by a device. Repeat the 
value of P for -f-1 at — 1 on the other side of zero. (It is true that this has 
no physical meaning, but the function might conceivably run symmetric- 
ally on either side of zero, and its graph has clearly high-«rder contact with 
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a horizontal tangent at zero.) Now take the four values at —1, 0 +1» +2 
and interpolate, using the resulting three differences only — 


X* 

P 


A* 

A* 

-1 

0*985612 

+0*014388 

-0*028776 

-0*022749 

0 

I 

-0*014388 

-0*051525 

— 

-fl 

0*985612 

-0*065913 

— 

— 

4-2 

0*919699 

— 


— 


Interpolating for the value of we have — 


+ «o=l-5Aoi+0*375Ao*-0-0625Ao* 
= 0-997825 


The true value, as stated above, is 0-997839, and we have got a closer 
result by this rearrangement, using third differences only, than we did by 
using nine differences before. 


24.12 Possible forms of polynomials . — The student may also get into 
difficulties if he does not bear in mind the forms that polynomials can, 
and cannot, take ; and if he attempts to use this method of interpolation 
where the polynomial is unlikely to represent the function well even over 
a moderate range. A polynomial (parabola) of the second order can take 
only the form (a) in fig. 24. 1 . A polynomial of the third order can take the 
form {b), or the form (c) with a wave in the centre. A polynomial of the 
fourth order can take a form very much resembling (6), but flatter in the 
centre, or a form like (c), but with three instead of two half-waves in the 
middle ; and so on. A polynomial cannot take the form (1) of a curve 
tangential or asymptotic to the vertical, like the end near zero of an ideal 
frequency-curve of the distribution-of-wealth type, or (2) of a curve 
slowly dropping asymptotically to the horizontal, like a logarithmic curve 
or the tail of the normal cur\’e — and such functions, mathematical or 
empirical, are very frequent in statistics. In this latter case it would be 
more probable that the function could be represented by a function of the 
form 

y = «;«o+«i*+****+ . . . 


Then taking logs we have — 

« = log,y . . . 

that is to say, we come back to the polynomial. Hence, if the function 
we are dealing with is tailing slowly away to zero, it is probably best to 
take logarithms and then interpolate on the logarithms. That is why in 
Example 24.4 we carried out a check in that way. There, as it happened, 
the direct method did not lead to bad results, but it is quite possiUe for it to 
give a comffletely nonsensical answer. For example, at the extreme end 
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of the X* table for v=28 (n'^29), we are given only the values of P 
corresponding to the following values of x * — 


X* 

P 


A* 

A» 

40 

0*066128 

-0*059661 

+0*053601 

-0*047929 

50 

0*006467 

-0*006060 

+0*005672 

— 

60 

0*000407 

-0*000388 


— 

70 

0-000019 

— 

— 

— 


Taking differences as shown and interpolating to get an estimate of the 
value of P for x*— 55, i.e, we have — 

«i.» = «o+l •5AoH0-375Ao*-0-0625Ao» 

= -0-000268 


But this is nonsense, for P cannot be negative. The pol 5 momial has done 
its best ; it reproduces the values at 40, 50, 60 and 70— but it can only do 
this by taking a form like (c) of 
fig. 24.1 (reversed) with a wave in 
the centre. It has, as a matter of 
fact, a minimum at x*=56-6 and a 
maximum at x*=65-8, or at 1 -66 
and 2 • 58 on the scale of «’s with 40 
as zero and 10 as the unit interval. 

If, instead, we take logarithms 
of the above values of P, inter- 
polate to third differences and then 
convert back to numbers, as in 
Example 24.4, we find 0-001699. 
for the required value of P — a 
value which is rational and is 
probably not far from the truth. 

For x*=30, P=0- 363218. Even 
bringing in this much larger value 
and using logarithmic interpolation with four differences, we find 0-001746 
for the value of P at x*'=55. This suggests that at least we may trust 
the value to two figures as 0-0017, which would be sufficient for practice ; 
bat the value has not been checked by direct calculation. 

Effect of errors in » on the differences 

24.13 The student may notice and be troubled by the fact that, in 
the Normal Curve Tables in the Appendix, second difier«ices appear to 
get a little irregular towards the tail of the curve ; the phenomenon will 
become much more evident if he continues the second Terences rather 
farther than they have been entered, and still more so in the higher differ- 
ences if he proceeds to write them out. The irregularities in question art 
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doe solely to the errors of rounding off in the last decimal place of the 
function. Before proceeding to consider the total effect of such a system 
of errors it may be best to consider the effect of a single error. 

24.14 Effect of an error in a single value of u . — If u=v + 1 *'. =A^i» +A%, 
and so on for all orders of differences. Hence, if v represents the true 
value of u and w represents an error, the differences of the error will 
simply be superposed on the differences of «, and we may consider the 
former by themselves. We may then, as below, take the true values of u 
as zero, and insert an error only at one point, say +e. 




A* 

A» 

A^ 

A* 

A« 

0 

0 

0 

0 

0 

0 

4 « 

0 

0 

0 

0 

0 

4 ^ 

- Se 

0 

0 

0 

0 

4 ^ 

— 5e 

4 15e 

0 

0 

0 

4- 

— 4e 

-hlOe 

-20d 

0 

0 

-f e 

-~3d 

4^6 

-10^ 

4l5ff 

0 


-2e 

4 a? 

— 

-f 5e 

~ 6e 


— d 

-f e 

— e 

4 e 

— e 

4 ^ 

0 

0 

0 

0 

0 

0 

0 


The resulting differences are written down above, up to those of the sixth 
order, and it is evident that the numerical coefficients of e in the differences 
or order r are given by the terms of (1—1)'. The effect of the initial 
error is therefore very rapidly increased as we proceed to higher and higher 
orders of difference, especially after the first three differences are past. An 
error of +e in u can produce an error of +3e or — 3e in the third differences, 
of 6e in the fourth differences, of lOe in the fifth and of 20e in the sixth. 
The maximum numerical coefficient for order r is derived from that for 
order r— 1 by multiplying the latter by 2 if r is even, or by 2r/(r+l) if 
r is odd. 

This magnification of the error renders differencing a very useful 
method of checking the calculated table of a function, and it is often 
employed for that purpose. The matter is not quite simple, for the effects 
of errors of rounding off in the last decimal place will be superposed on the 
effects of any actu^ mistake, but nevertheless the effects of the mistake 
are likely to show themselves clearly in, say, third or fourth differences. 
In the following table of square roots, for example, nothing is obviously 
wrong, but an error of 2 units in the last place has been introduced into the 
square root of 15, which should read 3 • 87298 (or more precisely, 3 • 8729833). 
■y^en we proceed to take differences, however, a suspicious irregularity 
shows itself in the third differences, and in the fourth differences it is clear 
that something is wrong. Since the position of the “ peak ” rises half a 
line at each differencing, the peak +2 shows that the mistake is in the 
root of 15. We can even estimate the magnitude of the error. If the fifth 
differences may be taken as approximately constant, we ought to get a fair 
estimate of the true fourth difference at the peak +2 by adding together 
tliat diltetence and the two on each side of it, the total effect of the error 
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Number 

Square root 

A‘{+) 

A*(-) 

A*(+) 

A* 

10 

3*16228 

0*15434 

686 

83 

-14 

11 

3*31662 

0*14748 

603 

69 

-12 

12 

3*46410 

0*14145 

534 

57 

-14 

13 

3*60555 

0*13611 

477 

43 

4* 2 

14 

3*74166 

0*13134 

434 

45 

-14 

15 

3*87300 

’ 0*12700 

389 

31 

0 

le 

4 

0*12311 

358 

31 

- 6 

i; 

4*12311 

0*11953 

327 

25 

— 

18 

4*24264 

0*11626 

302 

— 

— 

19 

4*35890 

0*11324 

— 

— 

— 

20 

4*47214 



MM 



e thus averaging out— compare the scheme showing the effect of the single 
error given above. This average is — 7‘6. We then have — 

6e = 4-2-(-7-6) 

e = +1*6 


This is very near the correct value, which, as will be seen from the true 
value of the root stated, is 300— 298'33 or 1 *67, the unit in the A* column 
being the last place of decimals of the function. 


24.15 Effect of a series of random errors in u . — Suppose these errors 
to be a, b, c, d, e, as below. Writing down their differences, we have the 
following results — 


Error A* 


A* 


A* 


a b--a 

b c-b 

c d--c 

d e—d 

e — 


c — 264'<* 3c-|-36 — a e — 4<f-i-6c — 

e^2d^c — — 


The general result is obvious. In differences of the fth order, the resultant 
error in any one difference is the sum of f +1 of the original errors multiplied 
in succession by the terms in the binomial expansion of (1—1)', or is 
of the form 


ei—ret+ 


r(r- 


r(r 


1.2.3 


««+ 


(24.6) 


If the errors e are distributed in a purely random way, so that e^ is un- 
correlated with e^^,, and if it may be assumed that the mean error is zero, 
then the mean error in the difference of the fth order will also in a long 
series tend to zero, and the standard deviation, s„ of the above quantity 
(24.6) is giv«i by 

- F(f)s,* 


. ( 24 . 7 ) 
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where s, is the s.d. of the original errors e, and F{r) is the sum of the squares 
of the terms in the binomial expansion of (1 —1}'. This may be shown to 

be equal to 

F{r) increases very rapidly with r. The following table gives the value 
of F{r) and of its square root from r=l to f =6 — 


r 

F{r) 

vm 

1 

2 

1*41 

2 

6 

2*45 

3 

20 

4-47 

4 

70 

8-37 

5 

252 

15-87 

6 

924 

30-40 


The standard deviation of errors in the fourth differences is therefore over^ 
eight times, and in the sixth differences over thirty times, the s.d. of the ‘ 
errors affecting w. 

If the decimal place in u be regarded as following the last figure 
retained, the errors of rounding off that figure may be regarded as uniformly 
distributed over a range ±0*5, and their standard deviation, s^, is therefore 
Vl /12 or 0*288675. This gives the following figures for the s.d. of errors 
in the successive orders of difference owing to the errors of rounding oft 

** Order of difierence 

1 

2 

3 

4 

5 

6 


S.d. of errors 
0*41 
0-71 
1*29 
2*42 
4*58 
8-77 


The effect of the errors of rounding off evidently increases very rapidly 
with the order of difference. With a mathematical function for which 
the true differences rapidly and continuously converge, the effect of the 
errors will in fact soon, so to speak, “ take charge ” ; the observed differ- 
ences will rapidly and steadily diverge, growing larger with each successive 
differencing. At the same time two other phenomena will show them- 
selves. Looking back at the scheme showing the effect of the errors 
a, b, c, d, e, it will be seen that in any one column the same error enters 
into successive differences with sign reversed. Also in any one line 
the same error enters into successive differences with sign reversed. 
Hence, as the effect of errors of rounding off becomes overwhelmingly 
great, (1) the differences of the same order tend to alternate in sign, (2) 
differences of successive orders on the same line tend to alternate in sign. 
If these phenomena start to show themselves, the student may well 
suspect he has gone too far in his differencing. It is evidently no use 
proceeding to an order of differences mainly agnificant of errors. 

These results for the effect on <Ufferences of a random series of errom 
have an application, not only to the effect of errors of rounding off in 
mathematical tables, but also to the theory of the variate-difieimice 
method (26.31). 
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Effect on different of subdividing an interval 

24.16 We mentioned early in this chapter (24.2) that, in general, it 
would become possible to use simple interpolation alone on a table of 
a mathematical function provided intervals were made sufhciently fine, 
but this was not proved. Let us consider the effect on the differences 
of subdividing an interval ; it will suffice to take the case of halving it, 
and for brevity let us confine ourselves to the first three differences. 

In terms of Newton's formula the values of « at 0, 0* 5, 1, 1 *5, are 

= «« 

«o-, = «o+0-SAo»-0- 125AoHO-0625Ao» .. 

. (24.8) 

«i = Mo+Ao^ 

«!•« = «o+l •5Ao^+0-375Ao*-0-0625Ao* ) 

If the student will write down these expressions at the left of a sheet 
of foolscap placed lengthwise, and take the differences in the ordinary 
way, he will find that the new leading differences for the subdivided 
series with intervals of half the original interval are given by 

V = 0-5Ao‘-0-125Ao*+0 0625Ao* 

V = 0-25Ao*-0-125Ao» . (24.9) 

V =0-125Ao» 

If the A’s of the original series converge rapidly, an assumption really 
implied by the fact that we stopped at the third difference, so that we 
can regard the successive A’s as of different orders of magnitude, it will 
be seen that V “ of the order of magnitude 0-5Ao^, is of the order of 
magnitude 0'25Ao*, and V of the order of magnitude 0-125Ao*. That 
is to say, the new differences are not only smaller than the original 
differences, but converge much more rapidly. 

If we had divided the original interval into ten instead of only two 
parts, we could have found the new leading differences in precisely the 
same way, and would then have obtained the result that was of the 
order of magnitude 0-lA#S ^o* o* of ma^itude O'OlAe*, and 

so on, the general rule being obvious. Hence it is only necessary to 
subdivide the interval sufficiently in order to render the differences so 
rapidly convergent that first differences alone can be used. 

In works on the method of differences, tables will usually be found 
giving for various values of the number of subdivisions the formulae 
relating the ^’s to the A’s. 

We now turn to some statistical problems. 

ft«akiiig up a grotq> 

24.17 Suppose we are given the numbers living> or the numbers of 
deaths, in successive ten-year age-groups, we may often deare to estimate 



572 


THEORY OF STATISTICS 


the numbers in smaller, e.g. five-year, age-groups, or even at single years 
of age. The initial difficulty and the method of procedure will best be 
shown by an illustration. 


Example 24.5 

The following are the numbers of deaths in four successive ten-year 


age-groups. Required to estimate 
50-55. 

Age-group 

25- 

35- 

45- 

55- 


the numbers of deaths at 45-50 and 


Deaths 

13,229 

18,139 

24,225 

31,496 


Now evidently interpolating directly between these figures will not help 
us. If we interpolated directly between the figure for 35- and the figure 
for 45- (half-way between), we would only have an estimate of the numbers 
in the ten-year age-group 40-50. We must proceed as follows. Add 
up the given numbers step by step ; this will give us a new set of figures 
showing the numbers over 25 but less than 35, over 25 but less than 45, 
over 25 but less than 55, and over 25 but less than 65. Interpolate in 
this new series to find the number over 25 but less than 50, and the differ- 
ences from the numbers next above and below will give the answer 
desired. The work is as follows — 


1 

Exact age 

2 

Sum of deaths 
from 25 to age 
stated 

3 

4 

A* 

5 

A* 

25 

0 

1 -f 13,229 

+ 4,910 

1 +1.176 

35 

13,229 

4-18.139 

+6,086 

+ 1.185 

45 

31,368 

1 +24,225 

+7,271 

— 

55 

55,593 

+31,496 

— 

— 

65 

87,089 





Column 2 gives the numbers from age 25 up to each age stated ; column 
3 the first differences, reproducing the numbers in the age-groups ; 
columns 4 and 5 the second and third differences. Since the two third 
differences are very nearly equal, working to third differences ought to 
give us a very fair result. We can accordingly take age 35 as our zero, 
and age 50 will be 1 *5 on the scale with the interval as unit. We have 
accordingly, 

« Mo+ 1'5A,>-|-0-375A,*-0 0625V 
= 13,229-t-l -5(18,139) -f0*375(6,086)-0-0625(l,l85) 

= 42,645-7 
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or 42,646 to the nearest unit. Subtracting 31,368 from 42,646, and 
42,646 from 55,593, we then have for our estimates of the numbers of 
deaths — 

45-50 11,278 
50-55 12,947 

As a matter of fact, the numbers in quinquennial groups were given, and 
for 45-50, 50-55, were actually 11,404 and 12,821 ; the error of our 
estimates accordingly is only of the order of 1 per cent. 

Example 24.6. — From the same data, estimate the number of deaths 
in the year of age 50-51. 

The limits of this group on our scale of intervals are, with 35 as origin, 
1 "5 and 1 '6. We have dready found the number up to 1 *5 in Example 
24.5, and it remains only to determine the number up to 1 -O, the difference 
between the two figures then giving the answer sought — 

«i., = «o-!-1-6AoH0-48Ao*-0 064Ao» 

= 13,229+1 •6(18,139)+0-48(6,086)-0-064{1,185) 

= 45,096-8 

or 45,097 to the nearest uiiit. Hence the answer is 45,097—42,646, or 
2451. 

Sinqile formula for halving a group 

24.18 The problem of estimating the numbers in the two five-year 
groups of which a ten-year group is composed occurs so often, that it is 
worth while deriving a simple second-difference formula for the purpose. 
Let «’s denote numbers in five-year groups, w’s numbers in ten-year 
groups ; and let d’s and A’s denote the corresponding differences. For 
second differences we need only consider three consecutive ten-year groups. 
From Newton's formula we have — 


«o “ “o 

= «0+V 

ie>o = 2«o+ V 
<** = W(|+240^+3 q* 
«8 = Wo+W+S^o* 


Wi — 2«o+5V+4do* 
«* = «o+4V+ 

Nj = Ho+S^lj^+lO^o* 


Wf 2«o+9i8'+164o* 
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Now write down these values of the w*s and difference — 

X Wx A* 


0 4V+4V SV 

1 2mo+5V+4V 4V + 12V 

2 2wo4-9V4-16V 

Whence 

Ao>=4(V+V) 

Ao* = 83o» 

or 

V = iAe* 

V = JAq*— |A o* 

Hence, 

**» — Mo+2<j0^+(Jo* 

= «0+iV-iV 
«, -K == -iAo^-iirV 
= -*(2A,HA,*) 

It will be convenient for practical work to express this directly in terms 
of the w’s — 

2^5^ = 2Wi—2'Wq 

Ao* = Wf—2Wi+Wo 

2Aoi+Ao* == w,— W q 

Whence finally, 

«* = JK+iK-w*)) • • • (24.10) 

Thus, taking the figures and problem of Example 24.5 again, we have— 

w, = 18,139 
Vi = 24,225 
w, = 31,496 
i(w.-w*) = - 1,669-6 
= 24,225 


and half this gives 


22,555-4 


«, = 11,278 


to the nearest unit, as before. For Wg, of course, we have also, as before, 
24,225—11,278=12,947. Equation (24.10) is really equivalent to the 
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method of Example 24.5, though in that illustration we used three differ- 
ences. But the third differences of the numbers " aged over 25 but 
under x " are equivalent to the second differences of the numbers in the 
successive age-groups. 

Graduathm 

24.19 If a graph is drawn showing the numbers of either sex liv ing 
at each single year of age, as given in any census which provides data in 
such detail, it will be foimd anything but smooth, showing the oddest 
peaks and hollows which repeat themselves, once adult life is reached, at 
ages showing the same final digits. Thus, in the Census of England and 
Wales there are conspicuous peaks at the round-numbered ages 30, 40, 50, 
etc. (last birthday), and hollows or deficiencies at the ages ending with 1 
and, less emphatically, at the ages ending with 7. With returns from less 
educated populations, the phenomenon may become almost ludicrous, e.g. 
in a certain Indian census sample-count — 


Age last birthday 

Number of males 

29 

927 

30 

12,294 

31 

652 

32 

2,058 

33 

672 

34 

892 

35 

7,723 

36 

1,437 

37 

870 

38 

1,362 

39 

467 

40 

10,391 

41 

460 


Now whatever irregularities might occur in the true figures, we may be 
quite certain that they should not show errors that are simply a function 
of the final digit of the age. We would prefer, therefore, to eliminate these 
errors. We could do so, somewhat roughly, by drawing a graph as 
suggested and sweeping a clean curve through the rather scattered and 
irregular points given by the data, subsequently reading off smoothed or 
graduated figures from the curve. The graphic process has many points to 
recommend it, but is very dependent on personal skill and judgment. It 
would be convenient to use a more " mechanical " proc^ that anyone 
could apply and be sure of obtaining the same results if he used the same 
process. It would be quite possible to fit polynomids to the data by the 
-methods of Chapter 15, but this would in general entail a great deal of 
labour and would not nec^sarily lead to satisfactory results, iag. udth sash 
h4^y erratic data as those above. More aiitable proce^ ean be 
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founded on the method of differences, and the general idea of them all is 
quite simple, though the details may vary greatly and the practical working 
of some of them become rather complex. All methods begin by assum ing 
that the totals of certain age-groups — five-year or ten-year age-p'oups as 
a rule — are reasonably accurate. These totals can then be redistributed 
over single years of age by the elementary process of Examples 24.5 and 
24.6, or the procedure can be in some way elaborated. We shall illustrate 
only the simple process. 

Example 24.7.— The English Census of 1911 gives the following numbers 
of males in the three age-groups stated. Obtain graduated numbers at| 
single years of age for the decade 40 to 49. 

Age-group Number 

30- 2,637,304 

4<y- 2.001,178 

50- 1,376,236 

As before, we form the sum of these numbers step by step from the 
top and then take differences. 


Exacjb 

age 

Sum of 
numbers 
from 30 

A‘(+) 

A»{-) 


30 

0 

2,637,304 

636,126 

11,184 

40 

2,637,304 

2,001,178 

624,942 

— 

50 

4,638,482 

1,376,236 

— 

— 

60 

6,014,718 





We now, taking 30 as our zero, require to interpolate at 1 • 1, 1 *2, 1 -3, etc. 
to 1 • 9. The coefficients of the several differences in the successive applica- 
tions of Newton’s formula are — 


A» 

A* 

A* 

•fll 

+0*055 

-0*0165 

+ 1-2 

+0*12 

-0*032 

-H-3 

+0*195 

-0*0455 

+ 1-4 

+0*S» 

-0*056 

+ 1-5 

+0*375 

-0*0625 

-I-1-6 

+0*48 

-0*064 

+ 1-7 

+0*595 

-0*0595 

+ 1-8 

+0*72 

-0*048 

+ 1-9 

+0*855 

-0*0285 


The results, with the known numbers to age 40 and to age 50 added, 
are as given in the second column below, and in the fourth column they 
are differenced to obtain the graduated numbers at each year of age, the 
total of which must agree with the observed total in the ten-year group. 
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1 

Exact 

age 

2 

Sum of population 
from 30 to age 
stated 

3 

Age 

last 

birthday 

4 

Graduated 

number 

40 

2.637.304 

40 

228.559 

41 

2.865.663 

41 

222.209 

42 

3.088.072 

42 

215.870 

43 

3.303.942 

43 

209.542 

44 

3.513.484 

44 

203.226 

45 

3.716.710 

45 

196.920 

46 

3.913.630 

46 

190.626 

47 

4.104.256 

47 

184.344 

46 1 

4.288.600 

48 

178.071 

49 

4.466.671 

49 

171.811 

50 

4.638.482 



Total 

— 

— 

2.001.178 


Below, these figures are compared with the actual returns at the single 
years of age and with two other graduations : (1) A graduation given in 
the Census report and prepared by Mr. George King, F.I.A., based on 
certain quinquennial age-groups. (2) A graduation using analogous 
methods, but based on ten-year age-groups, made at a later date in the 
Government Actuary’s Department, and reproduced by permission. The 
methods are described in rather more detail below. 


1 

Age 

last 

birthday 

2 

Census 

numbers 

3 

Graduation 

above 

4 

King’s 

graduation. 

5 

Graduation 

At 

40 

262,690 

228,559 

231.070 

231.397 

41 

198.344 

222.209 

223.721 

225.456 

42 

226.889 

215.870 

216.556 

219.233 

43 

196.204 

209.542 

209.314 

212,785 

44 

190.949 

203.226 

202.143 

206,169 

45 

202.458 

196.920 

195.193 

199.442 

46 

184.881 

190.626 

188.610 

192.661 

7 

176.713 

184.344 

182.577 

185,883 

48 

189.271 

178.071 

176.994 

179.165 

49 

172.779 

171.811 

171.589 

172,564 

Total 

2.001.178 

2,001.178 

1,997,767 

2.024,755 


If we compare the closeness of fit of the several graduations to the 
Census returns by adding up the difier«.oces, observed number less gradu- 
ated number, without r^ard to their sign, and expressing this tot^ as a 
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percentage of the population (2,001,178), it will be found that our gradua- 
tion gives a percentage deviation of 6*28, King's graduation {K^) a per- 
centage deviation of 6*09, and the graduation Kj a percentage deviation of 
6*40 — ^figures which do not differ very largely. It will be noticed, how- 
ever, that both the K graduations give, over the range considered, a smaU 
biased error, the total population over the ten years being too smaU for 
Ki and too large for Ky As regards the deviations of the several gradua- 
tions from one another, the percentage deviation of our graduation from 
is 0*64 and from K, 1 * 18, reckoned in each case on the true total popula- 
tion, and the percentage deviation of from Ki is 1 *35, reckoned on the 
K, total. At some individual ages the differences run up to nearly 2 per 
cent. This is a warning to the student that while it is true that the use 
of any one of these methods by different workers must, unlike the use of the \ 
graphic method, lead to the same result, yet the choice of different methods ‘ 
may lead to results almost, if not quite, as divergent as those obtained by 
different users of the graphic process. Graduated numbers of hundreds of 
thousands carried to the last unit suggest a degree of precision much 
higher than exists. 

There is evidently a certain imperfection in the elementary method we 
have used. If we employed the same method to graduate the numbers at 
ages 30 to 39, using the numbers in the three ten-year age-groups 20-, 30-, 
40-, there would be a discontinuity at 40, for the two graduated series would 
be given by arcs of distinct pol 3 momials. The discontinuity might not 
be conspicuous, but it would be there and would probably be brought out 
by differencing. To get over this, at least in part, a simple adjustment 
can be used. Continue the graduated series for 30 to 39 over the next few 
years of age, say to 42. Also continue our series for 40 to 49 backwards to 
37. Over the six years 37 to 42 we then have two graduated values at 
each age, and these may then be averaged with weights which gradually 
throw the weight from the earlier series on to the later — say such simple 
weights as 6 to 1, 5 to 2, 4 to 3, 3 to 4, 2 to 5, 1 to 6. We have also paid no 
particular attention to the choice of the limits of our ten-year age-group. 

Of course it might happen that the numbers were only compiled in ten- 
year groups like 20-, 30-, 40-, etc., and then there would be no choice. 
But if the figures are given at single years, the choice is at our disposal, 
and it may be that we have not chosen wisely. Part of the excess at the 
peak figure is probably drawn from lower ages, and it might have been 
better to keep the " peak ” at the round-number ages well inside the group, 
e.g. by compiling totals for the decades 35-, 45-, etc., rather than those 
used. 

King, in the Census graduation, used five-year age-groups as his 
basis, and chose the limits 4-8, 9-13, 14-18, etc., as probably giving the 
totals nearest the truth. Taking these five-year totals in successave sets 
of three, he used the precise procedure of our Example 24.6 to determine 
a graduated figure for the central year of the fifteen — e.g. the three groups 
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covering ages 4-18 would give a graduated number at age 11, the three 
covering ages 9 to 23 would give a graduated number at age 16, and so 
on. But here his process broke away. Taking four consecutive graduated 
numbers five years apart and determined in this way as " pivotal values,” 
he used the method of differences to determine a polynomial of the third 
order not passing through the four points Mq, u^, «„ but subjected to 

the four conditions (1) that it should pass through the two points «| 
and Mj, (2) that at and it should have a common tangent with the 
corresponding arc determined from the next (overlapping) set of pivotal 
values. In this way continuity was assured, but equality of observed 
and graduated totals for the five-year groups was lost. (The process 
used was a simplification of the process of osculatory interpolation, by which 
two arcs meeting at a point are given not only a common tangent but also 
a common radius of curvature. It might be called " tangential inter- 
polation.”) The desirability of using five-year groups may be questioned. 
It is true that ten-year groups are rather large, but the errors that we are 
trying to eliminate are definitely functions of the ten final digits, and 
however the limits are chosen there is likely to remain a systematic 
difference between the adjacent groups of successive pairs if five-year 
groups are used. 

The test of K^, in which an analogous process was used but based 
on the ten-year age-groups 5-14, 15-24, etc., was therefore of interest. 
Over the range of 30-80 years the differences between and Jf, gave a 
smoothly running cyclical curve with a tendency towards a period of 
ten years, as might have been expected. 

The simple process given in Example 24.7 is applicable throughout 
the bulk of life, but not at the two ends of the series, where special tricks 
of the trade have to be employed. The difficulty of interpolating in a 
” tail," where the numbers are slowly approaching zero, has already been 
pointed out. For graduation these difficulties are increased, and it is 
often best to drop the method of differences altogether and use some 
special process, such as assuming a law of decrease or fitting the tail of a 
frequency-distribution. 

Inverse interpolation 

24.20 By interpolation we determine the value of the function for a 
given value of the variable. If we are given the value of the function 
and find the corresponding value of the variable, we are performing 
inverse interpolation. The student has carried out the process, in a form 
corresponding to simple interpolation, whenever he has determined the 
number corresponding to a given logarithm by the use of a table of 
logarithms — not a table of antilogarithms. If we need only take first 
differences into consideration, the process is, in fact, very simple. From 
Newton's formula ve have 
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sBo 


whence 


X = 


V 


(24.11) 


where will naturally be taken as the tabulated value next below u,. 
If we must take second differences also into account, we have 




which gives the quadratic for x 

iAo*^*+(V--iV)*-(«.-«o) 

or, solving. 


X = 


2Ao*- 
2Ao 




'2(«*-Mo) , /2Ao»-Ao 


Ao* 




V 




(24.12) 


(24.13) 


The sign to be taken for the square root will be evident on carrying out 
the arithmetic. 

This is not always a very convenient expression to use, the solution 
(compare Example 24.8 below) being given as a comparatively small 
difference between two large quantities. If x^ is the approximate solution 
given by first differences, we can replace x in equation (24.12) by x^+h 
and solve for the correction h on the assumption that A* may be neglected. 
This gives 

. _ x,(l -Xi)Ao» 


where 


2*,A*+2Ao^-V 
= - *i)P 

2+(2*,— l)p 

Ao* 


If we may further assume that p is small, this reduces to 

* =i*i(l-Xi)/> . 


(24.14) 


(24.15) 


(24.16) 


Obtaining a first approximation from first differences, we can use (24.16) 
to get a second approximation, then insert this second approximation in 
(24.16) and get a third approximation, and so on until the process of 
approximation makes no -further difference. But note the assumption 
made that p is small. 

Example 24.8 

To find the approximate value of the quartile deviation, i.e, the value 
of */a for which ^=0*75, in the normal curve, given that for 
»/«r e= 0*6, 0*7, 0-8 the values of A are respectively 0’72573, 0*75804, 
0*78814. 
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The data are — 

*/<r A V Ao* 

0-6 0-72575 +0 03229 -0-00219 

Hence, 

0-02425 

and the first approximation to x by first differences only is 

= +0-S 

= +0-07510 

or measured from the zero of the scale, the first approximation to the 
quartile deviation is 0-67510. 

Turning now to the quadratic (24.13), the solution is 

X = 15-2443-14-4997 


= 0-7446 interval 
= 0-07446 

the sign of the root having evidently to be taken as negative. Using 
second differences, then, our approximation to the quartile deviation is 

0-67446 

The true value to five places is 

0-67449 

so the use of second differences only has left an error in the last digit. 

Let us see how the suggested process of approximation would have 
worked. From (24,16) — 


h = -0-0339114 x 0-751x0-249 
= -0-00634 
X, = 0-751 

X, = 0-74466 

No^ taking x, as the second approximation — 

h = -0-0339114 x 0-74466 x 0-25534 
= -0-00645 
= 0-751 

*, = 0-74455 

If we repeat the same process again, =0-74455, which is the same as 
so it is no use going further, and 0*67446 is as close as we can get. 
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If third and higher orders of difference are brought into account, we 
have an equation of higher degree than the second, which can be solved 
by Newton’s method of approximation, but the student will find more 
direct methods given in advanced works. 

Estimation of tiie position of a maximum 

24.21 In this and the following problem an elementary knowledge of 
the calculus is assumed ; the student who does not know the calculus 
may nevertheless find the results useful. 

Suppose we are given three equidistant ordinates Mo, «j, at 0, 1 
and 2. Required to find the position of the maximum of the parabola j 
passing through the tops of the ordinates. We have — ‘ 

«* = «o+*V+- j 

Differentiating with respect to x and equating to zero, the abscissa of the 
maximum is given by 

V+i(2*-i)V = 0 

or 

*=0-5-^', .... (24.17) 

Very often, perhaps most frequently, our data are not ordinates but 
rather areas ; e.g. if we want to estimate roughly the position of the mode, 
our data will be the total frequencies in three successive class-intervals — 
not the central ordinates of those intervals. We should then, as in Example 
24.5, form the sum of these data step by step and take the second differential 
of the polynomial passing through the resultant points in order to deter- 
mine the mode. Thus, calling the sum w — 


X 

w 

X 

Sum w 

0 

«*0 

~-0-5 

0 

1 

«.+v 

-fO-5 

«0 

2 


+ 1.5 

2uo+Ao" 


+2-5 

3i#p + 


It must be remembered that the sum w starts at half an interval be' 
zero, as shown. Using i's to denote the differences of w — 

V = «o 

V=V 

V»V 

d*Wg 
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or 



Since x is now measured from — this is the same answer as before. If 
we are concerned only with second differences of the data, and not with 
differences of any higher order, it does not matter whether our data are 
ordinates or areas. 

The method must be used \nth caution ; obviously it cannot give at all 
a precise result unless the data nm smoothly, and if it be used for determin- 
ing the mode, may easily give an answer appreciably divergent from that 
obtained by fitting a frequency-curve. The following illustration will serve 
as a warning — 

Example 24.9. — ^The following are the frequencies near the mode in a 
distribution of barometer heights. Estimate the position of the mode, (1) 
from the first three, (2) from the last three. 


Differencing — 



Height (inches) 

29*9 

300 

80-1 

30*2 

Frequency 

339*5 

382*5 

395*5 

315 


Height 

(inches) 

Frequency 


A* 

29*9 

339*5 

+43 

-30 

300 

382*5 

+ 13 

-93*5 

30*1 

395*5 

-80*5 


30*2 

315 

— 



Taking the first three frequencies and their differences — 

43 

* = = 1 ’933 intervals O- 193 inch 

.*. Estimated mode = 30*093 
Taking the second three frequencies and their differences — 

13 

X — = 0*639 interval = 0*064 inch 

Estimated mode = 30*064 

Our two answers therefore differ sensibly from each other, and also 
from the value given by a fitted Pearson curve, vis. 30*039. 

Modifying central ordinates to eq^valeat arott 

24i22 Supposing we fit a theoretical frequency-curve to an actual 
distribution, and want to determine the " goodness of fit ” by the x* 
method. We would usually proceed by calculating, from the curve 
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determined, the ordinates at the centre of each class-interval and taking 
these as the frequencies. But this procedure is not exact, for the central 
ordinates are not precise measures of the areas. In a class-interval 
centred exactly on the mode, for example, the central f maximum) ordinate 
obviously gives too large a value for the area. Required, to obtain some 
ample formula for modif}dng the central ordinates so as to give the areas. 

We have, by Newton’s formula. 


«• = Mo+*V+i(**— 

Integrate this expression for the interval round »i, i.e. between the 
limits 0*5 and 1 *5, and we will have an expression for the equivalent area, 
say — 

Jo-5 


:\ 


“'i = «i+*Ao* ) 


. (24.18) 


The first form of the formula is, in general, the more convenient, but the 
second may be the better if correction is wanted only to a single value of u. 
Example 24.10 

Table 24.5 (page 585) gives in column 2 the calculated ordinates of a 
Pearson curve at the centres of the class-intervals. In columns 3 and 4 
are given the first and second differences, and in column 5 are given 
the corrections Ao* /24, shifted one line down so as to be on the same line 
as the ordinate to be corrected. Finally, in column 6 we have the sum 
of the ordinate and the correction, or the area. The totals given at 
the foot are simply for the purpose of checking ; since columns 2 and 3 
both begin and end with zero, the sums of both first and second differences 
must be zero. Since column 5 is derived from column 4 by dividing 
by 24, its sum should also be zero, but errors of rounding off have made 
a very small negative excess. All the corrections are very small ; they 
are necessarily greatest where the curvature is greatest. 

24.23 A few words in conclusion. The process of interpolation, and 
still more that of graduation, is almost as much artistic as scientific. No 
absolute rules can be laid down, judgment must be used, and it is the 
experienced craftsman who is likely to get the best results with the least 
labour. If the student turns up his Latin dictionary he will find that 
interpolare means not only " to polish up ” (polire, to polish) — so that 
graduation is really the implication of the word — but hence " to corrupt, 
to falsify." It will do him no harm to bear this etymological meaning in 
mind, and keep a look-out accordingly. 
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TABLE 24.S 


1 

Class- 

interval 

2 

Central 

ordinate 

3 

4 

A* 

5 

Correction 

6 

Area 



0*00 

4 - 0*08 



0 

000 

+ 0*08 

4 - 0*70 

4 - 0*00 

0*00 

1 

008 

4 - 0*78 

4 - 3*08 

- f 0*03 

0*11 


0*86 

4 - 3*86 

4 - 6*91 

4 - 0*13 

0*99 


4*72 

4 - 10*77 

4 - 7 *18 

4 ' 0*29 

5*01 


15*49 

4 - 17*95 

- 0 *55 

4 - 0*30 

15*79 

5 

33*44 

4 - 17*40 

- 10*76 

- 0*02 

33*42 

6 

50*84 

4 - 6*64 

- 13*70 

- 0*45 

50*39 

7 

57*48 

- 7*06 

- 7*88 

- 0*57 - 

56*91 

8 

50*42 

- 14*94 

4 - 0*06 

- 0*33 

50*09 

9 

35*48 

- 14*88 

4 - 4*37 

4 - 0*00 

35*48 

10 

20*60 ! 

- 10*51 

4 - 4*67 

4 - 0*18 

20*78 

n 

10*09 

— 5*84 

4 - 3 *15 

4 - 0*19 

10*28 

12 

4*25 

— 2*69 

4 - 1*64 

4 - 0*13 

4*38 

13 

1*56 

— 1*05 

4 - 0*69 

4 - 0*07 

1*63 

14 

0*51 

- 0*36 

4 - 0 *25 

4 - 0*03 

0*54 

15 

0*15 

- 0*11 

4 - 0*08 

4 - 0*01 

0*16 

16 

0*04 

- 0*03 

4 - 0*02 

4 - 0*00 

0*04 

17 

0*01 

. - 0*01 

4 - 0*01 

4 - 0*00 

0*01 

18 

0*00 

0*00 

0*00 

4 - 0*00 

0*00 

Totals 

286*02 

4 - 57*48 

4 - 32*89 

4 - 1-36 

286*01 



- 57*48 

- 32*89 

- 1*37 



SUMMARY 

1. The first, second, third, . . . differences of a function are defined 
by the equations 

V =«i-«o 

V = Ai^-V 

etc. 

the intervals between successive values of the variable *• being equal 

2. By means of Newton's formula, 

we can interpolate for the value of 


T 
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3. Errors in the values of u become of increasing importance as the 
order of the differences increases. 

4. For inverse interpolation 


for first differences ; 


X = 


V 


/2K-?fo)_,r2V-VV 

2Ao^^ V \ / 

for second differences. 

We can also proceed by successive approximation. If Xj is the approxi- 
mate solution by first differences, a closer approximation is Xi+h, where 

^ A..* 

^0 


EXERCISES 

24.1 Given the following values for the normal integral 

xio P 

1-4 -91924 

1-5 -93319 

1-6 -94520 

1-7 -95543 

find the value of A for Af/a = l-54, noting the successive approximations 
up to third differences. Take at 1 • 4. 

24.2 Find as closely as possible the value of P for ;^*=:ll-7 from the 

following entries in the x* table (Tables for Statisticians) : v = 17 (n'=18). 
Note the successive approximations and the number of places to which 
your final answer is probably trustworthy. _ 


x* 

P 

10 

0-903610 

11 

0-856564 

12 

0-800136 

13 

0-736186 
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24.3 From the following entries in the same table for >’=24(n'=25), 
estimate as closely as you can the value of P for x*= 43 . Similarly, 
estimate the closeness of your approximation. 


x* 

P 

30 

0-184752 

40 

0-021387 

50 

0-001416 

60 

0-000064 


24.4 The following table gives the deaths of males registered in 
England and Wales during the three years 1930, 1931, 1932, at the ages 
stated. The figures on the right give the totals of the quinquennial groups 
which were, on this occasion, held to give the best totals for determining 
quinquennial " pivotal values.” Find graduated numbers for the ages 
40 to 44 inclusive. 


Age 

Numbers 

Quinquennial totals 

35 

3394 


36 

3505 


37 

3501 


38 

3947 


39 

3998 

18,345 

40 

4220 


41 

4281 


42 

5024 


43 

4993 


44 

5260 

23,778 

45 

5998 


46 

6113 


47 

6463 


48 

6921 


49 

7663 

33,158 


24.5 Let « 0 , «i, i# 3 , . . . «j 4 be the numbers in fifteen consecutive years of 
age, as in Exercise 24. , and W(„ a»„ the totals in the three quinquennial 
groups. Show that if we want only the graduated figure for u, as a 
“ pivotal value,” this may be written down at once from the equation 

«7 = 0>2«>j— 0-008A*ie»o 

(King’s formula). Verify by comparison with your answer to Exercise 24.4, 
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24.6 Generalising the above result, show that if w^, are three 

successive age-groups of r years each, we have for the graduated central 
value , , / \ 

2 / 24r* \r J 

and hence if r become indefinitely great, the central ordinate of the middle 
group of three, with areas Wq, w^, and common base c, is given by 

c 24 \ c / 

Verify by finding approximately the central ordinate of the normal curve 
from the areas between —0*3 and — O-l, — O-l and -fO-1, -fO-l and 
-f-0’3 x/ct. 

24.7 From the following (abbreviated) entries in the x* table, >'=9 
(«'=10), estimate the value of x* for which P=0-25 — 


x* 

P 

u 

0-2757 

12 

0-2133 

13 

0-1626 


24.8 The next table shows a frequency-distribution of 1,000 observations, 
and also gives the frequencies summed from the top. Estimate (1) the 
median, (2) the first decile, (3) the ninth decile, (a) as usual by simple 
interpolation, {b) by bringing second differences also into account. 


Interval 

Frequency 

X 

Sum of 
frequencies 
from 0 to ;r 

0-1 

28 

1 

28 

1-2 

76 

2 

104 

2-3 

114 

3 

218 

3-4 

141 

4 

359 

4-5 

158 

5 

517 

5-« 

142 

6 

659 

6-7 

119 

7 

778 

7-8 

95 

8 

873 - 

8-9 

63 

9 

936 

9-10 

33 

10 

969 

10-11 

18 

11 

987 

11-12 

8 

12 

995 

12-13 

2 

13 

997 

13-14 

2 

14 

999 

14-15 


15 


15—16 

1 

16 

1000 

Total 

1000 

1 

— 

— 
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24.9 The following are the mean temperatures (Fahrenheit) at Greenwich 
on three days 30 days apart round the periods of summer maximum and 
winter minimum. Estimate the approximate dates and values of the 
maximum and minimum. 


Day 


Temp. 

Date 

Temp. 

0 

15th June 

58*8 

16th Dec. 

40*7 

30 

15th July 

63*4 

15th Jan. 

38*1 

60 

14th Aug. 

62-5 

14th Feb. 

39*3 


24.10 Taking the value of the central ordinate of the normal curve from 
Appendix Table 1, estimate the area between the limits ±0‘lx/o, and 
verify your answer from the area table. 
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The general problem 

25.1 It often happens, particularly in economic statistics, that a set of 
similar events moving through time or space gives rise to some general 
concept expressing variation in their common element. The prices of a 
number of commodities on sale lead to the notion of a relative “ price 
level ; the various outputs of manufacturing plants generate the idea 
of changes in the ** volume of industrial production " as a thing-in-itself ; 
the yields of different crops in a set of agricultural districts suggest a 
comparison of agricultural productivity '' between different geographical 
areas. Although there is room for argument about the role of some of 
these concepts in providing explanations of phenomena, it will not in 
general be denied that they are useful subjects of inquiry, and in particular 
that knowledge is advanced when we can measure the properties which 
they represent, or at least the relative values at different times and in 
different places. In fact, when we leave the domain of philosophical 
discussion some of these concepts assume a degree of practical importance 
which is denied to more concrete and less contentious ideas ; whether we 
agree or not that there is such a thing as the cost-oMiving, we must 
admit that movements in wages and salaries in many countries are 
influenced (and in some are determined) by a measure of the relative 
level of the ‘‘cust-of-living*' expressed in the most definite numerical 
terms. 


25.2 In this chapter we shall be concerned with the measurement of such 
concepts as relative price-levels and changes in the general price-level 
by means of index-numbers, i.e, numbers which tell us, or at least purport 
to tell us, that if the price-level in such and such a year be denoted by 
100 it is now 127 (or thereabouts) ; or that, if the cost- of living of the 
working classes in London be denoted by 100, in this or that provincial 
town it is no more than 85 or 90. There are many different types of such 
quantities and it is not easy to frame a short definition to cover them all 
which shall be both precise and intelligible. In the majority of cases the 
index-numbers are calculated over a series of months or years and attention 
is directed to their variation in time, but comparisons also fall to be made 
in space, as in the case of cost of living in different towns above, or, to take 
illustrations from other fields, if we wish to compare standardised birth- 
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rates in different countries or shipping freight-rates in different sea-routes. 
From the elementary view-point, it is perhaps easiest to regard an index- 
number as a measure of central tendency in a group of items ; and many 
of the index-numbers in common use are nothing more than weighted 
averages of relative numbers for the several component items of the 
concept in question. 

25.3 Table 25.1 shows, in column (2) the average annual price of English 
wheat, as recorded in the Official Gazette, for the years 1930-1945 inclusive. 
In column (3) we show these prices expressed as a percentage of the price 
in 1930, and in column (4) the prices are similarly expressed as a percentage 
of the price in 1945. 

TABLE 25.1.— Prices of English wheat 


(1) 

Year 

Price 

(per quarter) 

(3) 

Column (2) 
as percentage of 

1930 price 

(4) 

Column (2) as 
percentage of 

1945 price 

1930 

s. d. 

34 3 

100 

55 

1 

24 0 

70 

39 

2 

25 0 

73 

40 

3 

22 10 

67 

37 

4 

20 2 

59 

33 

5 

22 2 

65 

36 

6 

30 9 

90 

50 

7 

40 0 

117 

65 

8 

28 11 

84 

47 

9 

21 5 

63 

35 

1940 

42 10 

125 

69 

1 

62 10 

183 

102 

2 

68 6 

200 

111 

3 

69 8 

203 

113 

4 

63 11 

187 

103 

5 

61 10 

181 

100 


The figures in columns (3) and (4) are very simple cases of index-numbers. 
The eye cannot very easily follow the variations in price by running down 
column (2), particularly if it is desired to gauge the magnitude of the 
variation through time. By expressing the figures with reference to the 
basic number of 100 we are, effectively, reducing the data to a convenient 
common scale. Such figures are usually called " price-relatives." 

M.4 Simple as this example is, it brings out several points of practical 
importance which are apt to be overlooked in dealing with the theoretical 
problems arising from more complicated types of index numbers. 

(fl) Arithmetically the series of columns (2), (3) and (4) are equivalent 
in the sense that they are proportional. Nevertheless, they may not 
convey the same impression, particularly to the lay reader. It is very 
natural to take the basic figure of 100, not as a purely convenient arith- 
metical quantity, but as some " norm " or " standard " of what fwi ght 
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to be. In onr example, to say that the price in 1942 vm twice as great 
as in 1930 may convey a different impression from saying that the price 
in 1930 was half that of 1942 ; in the first case we are taking the earlier 
year as the standard of comparison, in the second case the later year. A 
consumer of bread would probably incline to the former, an arable farmer 
to the latter. This kind of point becomes of special importance for 
economic index-numbers (such as those of wages or cost-of-living) which 
are likely to be the subject of controversy. It must always be remembered 
that the choice of the base-year may have to be exercised on grounds 
other than those of convenience to the statistician, or those which might 
appear to him of most importance. 

(b) It is common practice to refer to changes in an index-number from 
one year to another as a movement of so many " points ”, e.g. the index 
in column (3) of Table 25.1 fell by 6 points between 1932 and 1933. There 
is no great objection to this phraseology if the basic year is borne in 
mind, but it is apt to provide a misleading picture of the importance of 
the movement. The index also fell 6 points between 1944 and 1945 but 
clearly the relative fall in the second case was smaller than in the first 
(in fact, only about a third as great). 

Price index-numbers 

25.5 To fix the ideas, let us suppose that we require to construct an 
index for the United Kingdom of wholesale prices over a series of years. 
We shall first of all have to decide what commodities are to be covered by 
the index and how to collect the prices. This leads to a number of practic^ 
points which are apt to be troublesome (e.g. how to pick a representative 
set of commodities, how to treat imported articles, and how to deal with 
missing price-quotations) but which we shall pass over as not offering 
any special theoretical problems. We will suppose that we have m 
commodities whose prices in the /th year are t)q)ified by pif, Pu . . . pmm 
These are heterogeneous quantities, each of them representing, it is true. 
" money per quantity ” (for that is what is meant by a price) but the 
quantity in terms of which the price is stated varying from commodity 
to commodity. For pig-iron it is, say, a ton ; for raw cotton it is also a 
weight, but the weight is only a pound, and for a precious metal only 
an ounce ; for beer or wine it is not a weight at all but a volume ; for 
woven textiles perhaps a *' piece ”, and so on. In order to apply any of 
the conceptions or methods of previous chapters, e.g. frequency distribu- 
tions, averages, measures of dispersion, etc., we require a homogeneous 
set of quantities all of the same ^mensions ; as stated in 5.4 an average 
” is merely a certain value of the variable, and is therefore necessarily of 
the same dimensions as the' variable ” so that if the data are of differing 
dimensions the average has no assignable meaning. As an initial step, 
therefore, we want to convert the heterogen«>us figures of our price-table 
into a homc^eneous set of figures all of the same dimensions. Thb can 
t)e done in more ways than one, but the rimplest is to apply to eimh 
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cdumn of our pti(»s-table the process used in Table 25.1, Le, to craivert 
the given prices into price-relatives. As these are simple ratios, they are 
all pure numbers. Table 25.2 illustrates the procedure ; Col. 2 repeats 
the wheat prices of Table 25.1 and in Cols. 3 and 4 are added the Gazette- 
prices of Barley and Oats-. In Cols. 5, 6, and 7 these prices are converted 
into price-relatives with 1930 as base-year. 

TABLE tSJZ. — Priot* and priceHtelatim of wheat, baiicy and oats 


(1) 

Price per quarter 

Price relative (1930«»100) 

Year 

(2) 

(3) 

w 

(5) 

(6) 

(7) 

Wheat 

Barley 

Oats 

Wheat 

Barley 

Ciats 


8 . 

d. 

8 . 

d. 

s. d. 




1930 

34 

3 

28 

3 

17 2 

100 

100 

100 

1 

24 

0 

28 


17 8 

70 

99 

103 

2 

25 

0 

27 

1 

19 3 

73 

96 

113 

3 

22 

10 

28 

7 

15 10 

67 

101 

92 

4 

20 

2 


11 

17 5 

59 

109 

101 

5 

22 

2 

28 

7 

18 9 

65 

101 

109 

6 

30 

9 

29 

5 

17 8 

90 

104 

103 

7 

40 

0 

39 


23 11 

117 

138 

139 

8 

28 

11 

36 

4 

21 2 

84 

129 

123 

9 

21 

5 

31 

7 

19 3 

63 

112 

113 

1940 

42 

10 

64 

10 

37 2 

125 

229 

217 

1 

62 

10 

85 

0 

40 10 

183 

303 

238 

2 

68 

6 

165 

5 

42 0 

200 

586 

245 

3 

69 

8 

113 

5 

43 8 

203 

398 

254 

4 

63 

n 

94 

6 

45 3 

187 

335 

264 

5 

61 

10 

89 

2 

45 9 

181 

316 

267 


2S.6 In terms of our symbols then, we replace each price p^f by a price- 
adative where, ignoring the factor of 100, 

■pHlPr (25.1) 

pff being the price of commodity r in the standard year (or, to put it more 
generally, the standard price of commodity r, for prices in a single year 
are subject to casual disturbances and it may better to take as standard 
the average price over a five or ten year period). We may now average 
the relatives (25.1) in any way we please in order to obtain our destr^ 
index-number for the “ relative general level of prices ". If we take the 
simple arithmetic mean of the f's, We have, using to denote this fmrm 
of index-number for the year j and S to denote summation for all aan- 
modities r 


j, .!;;(&) . {2M> 

For ihstance, in the data of TaUe 25.2, ’1(18^+ 

3164'267)»255. 
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This formula, however, attaches precisely the same ' weight ’ to each 
commodity whether little is sold at the specified price or much ; a com- 
modity such as wheat is given no more weight than a commodity such 
as pepper, in spite of the enormous difference between the quantities 
moving into consumption. We have therefore to consider whether some 
system of weights can be introduced to allow for this effect. 


25.7 Since our price-relatives are all of the same dimensions (pure 
numbers) our weights should also be all of the same dimensions. They 
cannot t before be quantities, for some of the quantities are actual weights, 
some volumes, and so on. Suppose then we make the weight for each 
price-relative the money spent on that particular commodity in the base 
year (or the average annual amount in the base period), say p^, q„ where 
is the quantity in question. Then for form B of the desired index- 
number we have 


T ^ (^f< Pt » gr« ) 

>„) ■ 


. (25.3) 


_ ^ {Prt 9 t$) 

2 {p„ ?„) 


. (25.4) 


This is a remarkable result, for (25.4) is simply the ratio of the cost of 
the given "basket of goods” (the quantities sold in the base year or on 
an average in the base period) at the prices of year^ to its cost at the prices 
of the standard year or period. Looking at the matter in another way, 
we have converted our heterogeneous price-figures, as required, into 
homogeneous figures by multiplying each price by a quantity expressed 
in the same units as are used in specifying the price and thus turning them 
from " money per quantity ” into " money.” 

25.8 Although the index-number gl has a fairly intelligible meaning, it 

is still open to some objections. In fact, it depends on the quantities sold 
in the basic year, and if the actual quantities vary substantially from' 
year to year there is some ground for arguing that such a fact ought to 
be taken into account. For example, if over a period the proportion of the 
average household income spent on food drops from 40 per cent in the 
basic year to 25 per cent, it seems obviously wrong to continue to weight 
food-prices by a factor of 40 per cent. Our weights, so to speak, ought 
in some sense to be kept up-to-date. ~ 

25.9 Before discussing this problem in generality, let us make four 
preliminary observations — 

(«) We noted in 14.15 that errors in weights, if uncorrelated with the 
prices to which they are attached, will not exert much effect on the index 
numbers. Thus, if the weights change rather erratically or by small 
amounts from year to year, the accuracy of the index is not seriously 
affected. For practical purposes, therefore, shoiUd give a reasonable 



INDEX NUMBERS 


595 


comparison between years which are not far apart in time and may be 
satisfactory over quite a long period unless there is some systematic move- 
ment in weights during that period. 

(6) Purely practical difficulties in determining weights from year to 
year may make some formula of the type (25.4) the only one which can be 
calculated in time to be of any value. 

(c) It is arguable on theoretical grounds that (or some similar 
formula based on a different t 3 q)e of average) is the correct form to use 
in estimating price changes. If we make allowance for changing quantities 
we may be confusing price change with other things. For instance, an 
index of the form S(^,j q„) l'^{Pr» 9rs) measures the ratio of the total 
expenditure in the /th year to that in the basic year, and to that extent 
is a definite measurable quantity. But when we try to dissect that part 
of it which is due to price change from the part due to change in quantity, 
we are in difficulties, for so far as observation goes the two things are 
really inextricable parts of the same phenomenon. There is, in fact, an 
element of convention in our definition of a price index-number. The 
statistician will always remember how his index-number is calculated and 
will know how far he can use it in any particular argument. If he chooses 
to define his price-index by reference to the “ fixed basket of goods ” he 
is perfectly entitled to do so. He may perhaps be challenged on the 
grounds that his price-index does not possess some desirable properties 
which might be expected of a perfect {irice-index. He cannot fairly be 
accused of doing anything wrong ; only of doing something inexpedient. 

25.10 We have attempted to simplify the discussion by speaking of prices 
and years in the construction of our index-numbers. Evidently similar 
considerations apply when the periods of comparison are not years, but 
some other unit of time, except that for short periods, such as months, 
we may have to pay some attention to seasonal effects. Most of what we 
are saying about price-indices also applies to other forms of index-numbers 
although there are certain features of prices which give rise to special 
difficulties. Broadly speaking, the theory of price-indices covers the 
general case, and indeed other index-numbers are frequently much easier 
to construct when they can be freed from measurement in terms of money. 
We shall refer to the so-called quantum indices below (25.20). In the 
meantime, we continue to discuss price indices on the understanding that 
our discussion has a somewhat wider application. 

Geometric means 

25.11 The same kind of considerations which led us in Chapter 6 to 
express a preference for the arithmetic mean in determining averages 
also apply to its use for index numbers, except perhaps that the argument 
from sampling simplicity is not so strong. The use of medians and modes 
is to be deprecated and only the student of statistical history is likely 
to encounter them in connection with index-numbers. There is, however. 
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something to be said in favour of the geometric mean, particularly in 
connection with price indices. Let us note the formulae corresponding to 
J and 

For the index based on the geometric mean of a set of prices relatives 
we have 

= . . . . (25.5) 

where m is the number of prices concerned. For purposes of calculation 
this is more easily written as 

log ^, = i|2:(logA,)-X(logp„)| . . . (25.6) 

Clearly (25.6) can also be written as 

log^,= is(log^,/p„) . . . (25.7) 

It makes no difference whether we take the ratio of the geometric means 
or the geometric mean of the ratios. 

For the corresponding index to we have 

( / 9n j ?fi\)l/S(9r,) 

jpr, )\ 

which is more conveniently written 

log 2 (?« log (Pn/Pr.)^ (25.8) 

Example 25.1. — As an example of a price index-number calculated from 
the arithmetic mean by reference to a fixed set of weights in a basic 
period, we consider the British official " interim index of retail prices.” 
This used to be known as the ” cost-of-living index,” a term which the 
authorities are attempting to abandon in favour of a more neutral t 3 rpe 
of wording. A better phrase would be ” household budget price-index ”, 
since the object of the index is to measure changes in the average retail 
{Vices of the items composing the expenditure in an average household 
budget. The two main practical questions for decision in constructing 
the index are ; what commodities are concerned and what is their relative 
importance in the " average budget " ? 

For the index-number, which was first published in 1947, the Ministry 
of Labour used data collected in 1937 /9 by sampling about 10,000 house- 
hold budgets. The information gave, in considerable detail, the expendi- 
ture on all items for four separate weeks in October 1947, January 1938, 
April 1938 and July 1938, and an arithmetic average of the four was 
regarded as representative of the proportionate expenditure on each item 
ovor the year. Some of the budgets were collected from agricultural 
households and were separated for the construction of an index relating 
to.agiicultural workers. 
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There are about 90 items involved and they are classified into eight 
groups — 

1. Food 

2. Rent and rates 

3. Clothing 

4. Fuel and light 

5. Household durable goods 

6. Miscellaneous goods 

7. Services 

8. Drink and tobacco. 

Current prices for the 90 items are collected by the Ministry of Labour 
from various sources, e.g. by visits of local officers to retailers in regard 
to food or by inquiries of local authorities and property owners' associations 
in regard to rent. These prices are related to the corresponding figures 
for the basic date, namely, 17th June 1947, taken as 100. 

It then remains to compound these price relatives into an index for each 
group, and finally, to compound the eight resultant indices into a single 
index. The same principles are employed in each case and effectively 
amount to the use of equation (25.3). They may be exemplified by the 
method of constructing the final index from the eight component indices. 

In calculating the final index, a weighted arithmetic mean is taken of 
the components, the weights used being as follows 


Food .... 348 
Rent and rates . . 88 

Clothing . . .97 

Fuel and light . . 65 

Household durable goods 71 
Miscellaneous goods . 35 

Services ... .79 

Drink and tobacco . .217 


Total 1,000 

Thus for instance, the index numbers of the eight groups in mid-December 
1947 were respectively 103-4, 100-1, 102-4, 107-1, 106-3, 109-2, 102-5, 
104-1. Taking our origin as 100 we have for the index for " all items *’ 

1 00 + { (348 X '3 - 4) + (88 X 0 - 1 ) + e tc. 

+(217x4-1)} /l.OOO: 103-7 

The weights in this case are an attempt to represent the proportional 
expenditure in 1947 on the eight groups, e.g. it is estimated that 34-8 pn 
cent of household expenditure was devoted to food. As no definite 
information for 1947 was available, the proportions shown in the budg^ 
inquiry of 1937/8 were adjusted to take account of changes in price 
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between 1937 /8 and mid- June 1947. The proportion attributable to 
drink and tobacco was scaled up to take account of 1947 conditions. 

Example 25.2, — An illustration of a price-index calculated by the use 
of geometric means with a fixed set of weights is provided by the British 
index-number of wholesale prices. This Index purports to measure the 
movement in the prices of wholesale commodities. It was revised in 
1935 on the basis of information obtained from the Census of Production 
of 1930. 

There are 200 commodities composing the Index, the numbers, in 
eleven groups, being as follows — 


Group 

Number of 
Commodities 

Cereals .... 

. 20 

Meat, fish and eggs 

. 20 

Other food and tobacco 

. 28 

Total — Food and tobacco 


Coal 

. 9 

Iron and steel 

. 37 

Non-ferrous metals 

. 8 

Cotton ..... 

. 10 

Wool 

. 11 

Other textiles 

. 9 

Chemicals and oils . 

. 15 

Miscellaneous. 

. 33 

Total — Industrial items 

. 132 

Total — All articles 

’ 2W 


These numbers, which are effectively weights for the groups concerned, 
are based approximately on the relative importance of the various items 
as indicated by the production figures in the 1930 census and imports 
in that year, importance for this purpose being measured by the value 
of the gross output. 

Prices are obtained from various sources, mostly from trade publications, 
and relate to certain standard types or specifications. In some cases the 
prices of two or more qualities are averaged for a particular commodity 
so as to give a wider coverage. 

In the construction of the Index for any particular commodity the 
price is recorded weekly where possible and an arithmetical average of 
the weekly quotations provides a figure for the month. This is then 
related to the price in the corresponding month of the basic year by means 
of a simple price-relative. A composite index is then constructed for the 
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month for each of the groups specified in the above table by taking a 
geometric mean (the actual arithmetic process is somewhat different, but 
this is what it amounts to). 

The monthly index for “ All Articles " is obtained as a geometric mean 
of the price-relative for the 200 items in the above table. This is equivalent 
to taking a weighted average of the eleven groups with weights given by 
the number of commodities listed above. An annual index is constructed 
by taking the geometric average of the index numbers for the twelve 
months. 

It will be noticed that in neither of the two examples we have just 
given — two of the most important industrial indices in the United 
Kingdom — are the weighting factors actual quantities. For the budgetary 
index they are based on proportional expenditure in a standard period, 
for the wholesale price index they are based on value of gross output in 
a standard period. 

The time-reversal test 

25.12 Let us now consider generally some of the properties which we 
should like to have in an index number. We will not dwell on properties 
such as ease of calculation, but will discuss some desiderata which arise 
from our general notion of the functions which an index number ought 
to perform. 

In discussing the price-relative of Table 25.1, we noted that the series 
of columns (3) and (4) were equivalent in the sense of being proportional. 
The difference in the base year makes no difference to the index numbers 
except one of scale. To put it slightly differently, the relative of year a 
based on year .6, say kab, is the reciprocal of the index of year 6 based on 
year a, say kba (except for the factor of 100 which we may ignore for 
present purposes). That is to say kab kha 

The price relative therefore obeys what we may call a time-reversal 
test ; and this is clearly a property which we should welcome in any 
index number, for then our comparison between two years does not 
depend on which year we regard as the base year. That is, we should like 
an index number to obey the relations 

lab I (25.9) 

Of the four indices we have considered earlier in the chapter only one 
obeys the time reversal test, namely When we introduce weights 
appropriate to a fixed base-year the time-reversal property is destroyed. 
For instance, with we have 

gfab == S {Pra Jrb) /2 {prb ^rb) 
giba = 2 (prb Jr) /2 {fra Jra) 

and equation (25.9) is not obeyed unless the are equal or proportional 
to the jfb's 
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Nevertheless the test may be approximately obeyed if the changes in 
weights from to qn are small or if they are not highly correlated with 
the prices. For let 


9ni » + if where ir is small. Then 

j j £ qi^ £ {pHf{qfb“\‘ if)} 

glab jite V /A . - .\ V (j. /« . I 


which is approximately 


£ {Pfa qrb) £ if)} 

£ {pfb qrb) £ {^f4i{yf6+^r)} 

( i:{Pf,ir) U ^iPraSr) 
\ ^J:{PftqH>)l\ ^lliPraqrt) 

^ ^(prb if) Y.(Pra ir) 

£ (Prb qrb) £ (Pra qrb) 


(25.10) 


As the quantities ir are small the two terms on the right in (25.10) will 
in general be small ; and even if they are moderately large the terms will 
be small if £ (prb ir) and £ (pra ir) are small, i.e. if ir is only slightly 
correlated with pra and Prb ; or if pra—prb is small. 

Similarly for we have 


log Jot Jbo =2^^ 

We may suppose that S (jm) =S (jr») for the total " weight ” may con- 
ventionally be kept constant, and thus we find, after a little reduction, 

log jjo) Jit — 2 ^ (^'»/^'*)| 

This is nearly zero if the ^’s are small or if S is only slightly correlated with 
the logarithm of the price changes and again the time-reversal test is 
obeyed approximately. 

25.13 In order to obtain an index number which is 6ertain to obey the 
time-reversal test we may proceed as follows : 

With the base year h, the index J for a year a is given, by 

jr _ ^(Pra ?>*) 

With the weights of the year a, but still with b as base, we have an index 
number 
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We then define 




— yj 

'V( 


c/'-) 


^(fita frt) S(fifa y«) 
Jrt) S(pft qr^ 


(25.11) 


This was called by Irving Fisher the " ideal ” index-number. He regarded 
it as the best possible. 

Examination of (25.11) will show that the time-reversal test is obeyed, 
for the reciprocal of gl is the product of giba and gl't*. The principal 
difiiculties in using the " ideal ” number are practical ones ; we rarely 
have data in sufficient detail to allow us to calculate it over a series of 
years. 


The factor-reversal test 

25.14 Irving Fisher (The Making of Index-Numbers, 1922) also proposed 
what he called a factor-reversal test for price index-numbers. He argued 
that if we interchange the symbols for price and quantity we should reach 
an index of quantity changes which, when multiplied by the index of 
price changes, should measure the change in total value. Consider, for 
instance, <,/«». For the price index-number we have 

r _ ^ {P'» ?>*) 

^ S (Prb qn) 


Now if we interchange p and q we have an index which we may write 


. _ 'Z(q„pn) 

S (qrb Pri) 


. (25.12) 


This may be regarded as an index of quantity of type weighted according 
to the prices prh in the basic year b. Now we have 

r T _ S (pn qth) S [pri gm) 

But this is not equal to the index of total expenditure 'L(Pn q^yZiPH f 4 ) 
and hence the factor-reversal test is not obeyed. 

2S.1S Of the indices we have considered in this chapter only the “ ideal " 
index obeys the factor-reversal test. The reader can easily verify that 
this is so from equation (25.11). This was, to Fisher, a powerful reason 
in favour of the “ ideal " index. It does not appear to us to carry quite 
so much weight as he attributed to it. There is an element of convention 
in the construction of an index of quantity such as J, juat as in the ptioe 
index itself and obedience to the factor-reversal test would appear to be 
most required when indices of price and quantum 0^.19 below) ate requited 
together. 
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The dxcular test 

25.16 If an index is constracted for year a on base-year h, and for year 
b on base-year c, we may derive an index for a on base-year c. The 
so-called “ circular " test requires that if we do so we ought to get the 
same result as if we calculated direct an index for a on base-year c without 
going through & as an intermediary. To put it another way, we require 
that 

. . . . (25.13) 

which presents a kind of extension of the time-reversal test of equation 
(25.9). We may note in passing that we shall not require to examine 
more complicated criteria such as 

la, lie la Ida = \ . . (25.14) 

for such are always fulfilled if (25.9) and (25.13) are satisfied. For then 

lot lie — 1 flea, la Ida = 1 jiae 

and hence the left-hand side of (25.13) becomes 1 //«/«« = 1 

25.17 The circular test is obeyed by J but not by any of the other 
indices we have considered. Fisher, in fact, for reasons which we do not 
regard as very cogent, argued that an index-number should not obey 
the circular test. We need not dwell on the point, since it may be shown, 
as for the time-reversal test, that the circular test is approximately obeyed 
if weights do not change very substantially over the period for which 
comparisons are being made. 

Departures from the fulfilment of the circular test are perhaps more 
important in comparisons in space than in time, for them the weights are 
likely to differ to a greater extent. For example, index-numbers pur- 
porting to compare industrial production, cost-of-living or price levels 
between different countries may depart from the “ circular ’’ criterion very 
considerably. By the use of an appropriate set of weights it may be 
possible to compare country A with country B ; but to compare either 
with C new weights may be required. Hence it is quite possible to find 
that the “ production ”, for example, in A is greater than in B, and in 
B is greater than in C ,whereas a direct comparison can show that /4’s 
, " production ” is less than • C’s. This inconsistency really implies that 
w6 are trying to do too much with our index numbers. There are limits 
to the amount of information we can compress into single numbers for 
comparing areas or periods in which conditions are very different. The 
most workable method of approach is probably the one we noticed in 
dealing with death-rates (14.17) where a standard set of weights is used 
for each index. 
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Examph 25,3 . — Moving weights 

An interesting attempt to deal with the question of changing weights 
was made in the official index of agiicultural prices introduced by the 
British Ministry of Agriculture and Fisheries in 1938 (Houghton, J. Roy. 
Stat. Soc., 1938, 101, 275). Between the two world wars the pattern of 
agricultural production changed considerably in the United Kingdom 
owing to the movement from arable to grassland farming and the introduc- 
tion of some new crops such as sugar beet. Weights were calculated for 
each year for the various items entering into the index, based on the 
proportionate contribution by value to the total output. A five-yearly 
moving average was taken of these weights and the weighting factors 
used for any particular year was the value of this average for the five 
previous years. In an industry such as agriculture, wherein changes 
from year to year are not very large, this slow and continuous adjustment 
of weights to current conditions has much to recommend it. Com- 
parisons between years which are fairly close together can be made with 
confidence. 

Linking methods 

25.18 Situations sometimes arise in which we may compare each of a 
series of years with the next, but cannot so easily compare years which 
are separated in time. This is particularly so when weights are changing 
rapidly or when new commodities enter the market or disappear from it. 
In such circumstances it may be possible to construct an index for year 
2 based on year 1, for year 3 on year 2 and so on, and hence to construct 
a continuous series by linking successive years. If, for instance, the index 
for year 2 on year 1 is and that for year 3 on year 2 is etc., we may, 
taking year 1 as base, regard e’aU index for year 3, i^i Ji as the index 
for year 4 and so on. Comparisons for successive years are not invalidated 
though those for widely separated years may be very unreliable. Index- 
numbers of this kind are sometimes useful as presenting a general picture 
of movements over a period ; but they are obviously not so firmly 
founded from the theoretical viewpoint as those we have described 
above. 

Example 25.4 . — Index number of shipping freight rates (Isserlis, J. Roy. 

Stat, Soc„ 1938, 101, 53.) 

It was desired to construct an Annual Index representing the course 
of Tramp Shipping Freights over the period 1869 (when the Suez Capal 
was opened) and 1936 when the calculations were carried out. From the 
outset it is clear than any index of this character will require careful 
. interpretation, for the period concerned was one in which sea transport 
was revolutionized by the change from sail to steam, and later a further 
partial change to propulsion by Diesel engines. Furthermore, details of 
the freights for all voyages undertaken in this period were not available, 
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and the actual quantities carried were also not available. In spite of the 
unpromising conditions of the problem an index was constructed on the , 
following lines — 

Quotations of the highest and lowest freights in a particular year were 
available over the period concerned and the mid-point between the two 
was taken as representing the average freight for the year. This is a crude 
form of average necessitated by the paucity of the data, but is probably 
reasonably accurate except in years such as 1915 when freights trebled 
as compared with the year before owing to the circumstances of World 
War I. , 

These average freights were available for 210, homeward routes to the I 
U.K. and 112 outward routes, but owing to the varying nature of the ^ 
tramp trade over the period, quotations were not available in respect 
of each route for each year. Consequently for any particular year there 
were a number of missing quotations. For each route where quotations 
in consecutive years were available a price-relative was constructed 
based on the previous year; for example, for the route Java /U.K. in 
Sugar the freight in 1870 was 93 per cent of that in 1869 and the price- 
relative was therefore 93. In 1919 the freight was 31 per cent of that for 
1918, and the price-relative was therefore 31. 

For each year the available price-relatives were averaged arithmetically 
over homeward and outward routes to give an average price-relative for 
that year as compared with the previous year. For example, the average 
price relative for 1936 was 117*3. 

An index over the 68 years concerned was then constructed on the 
basis of a chain method. The average price-relative for 1870 wais 103, 
and on the basis of 1869=100 the freight index was also 103. The average 
price-relative for 1871 was 99 and the index was therefore taken as 
{99x 103)/100, namely 102. Similarly, by this chain method, the index 
was built up from one year to the next. The freight index for 1935 
was 88. The price relative for 1936, as noted above, was 117*3 and 
therefore the index for 1936 was (88 x 117*3) /lOO, namely 103. 

For the purpose of giving a general view over the period, the index 
is perhaps not unsatisfactory. Although the rates on individual routes 
cannot be weighted by reference to the quantity of traffic the large 
number of routes employed ensures some degree of weighting in the index 
as a whole according to volume of traffic ; and although a comparison 
between neighbouring years is more reliable than one between two years 
which are widely separated in time, it is between the closer years that 
comparisons most frequently fall to be made. 

In 1935 there became available detailed information of tramp. voyages 
un<krtaken in U.K. sliips in that year.. It was then possible to construct 
an index of tramp shipping freights weighted according to gross freights 
earned on cargoes carried in that year. The agreement of this index 
with the chain index was fairly good. 
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Quantum indite 

25.19 Reduction of the data to homogeneity is a pre-requisite of all 
index numbers of the type we have considered in this chapter, and we have 
already noticed that in many instances the only available common unit 
is money value. Unfortunately, this is precisely the unit which does 
not remain constant over periods of time. If we measure the' industrial 
production of a country by the value of the gross output of its manu- 
facturing plant and find that the value in 1948 was twice as great as in 
1938 we obviously gain a very poor idea of the change in output over the 
period in any real sense. Our prices have changed in the meantime. 
Can We then measure in ariy reasonable way what is the change in output 
apart from changes in prices ? Can we obtain some index of production 
which is related to physical output and is free from changes in prices or 
money values ? 


25.20 Suppose that in the basic period the value of the output of a 
commodity is typified by Vn and the price of some unit by pn. If the 
price in the^th year is prj, and the output is valued at Vr}, then the quantity 
Pt$IPfi is what the output would have been valued at if the price had 
been that of the basic year. We may then construct the index-number 




S (Vti pnipfj). 


(25.15) 


Tliis is the ratio of the value of the output in the jth year, revalued at 
" basic ” prices, to the value of the output in the basic year. It evidently 
goes a long way to meet our requirements. It bears a kind of inverted 
relation to the index of equation (25.4). If there exist quantities q such 
that Vrj—prj qtj we have, on substitution for v in (25.15) 


_7, _ S (^ gnf) 

S {pn qn) 


. (25.16) 


which exhibits ^ as an average of quantities q weighted by prices in the 
basic poiod — a similar index to that of equation (25.12). As noted in 
25.15, the factor-reversal test requires that our indices of price and output 
shall, when multiplied together, measure the change in value of total 
output — a very reasonable requirement when both indices are used but 
not necessarily a desideratum when only one of them is to be calculated. 


25.21 It is of some importance to note that we can calculate jiy from 
(25.15) even when quantities q do not exist. Suppose, for example, we 
ate constructing an index of the price of travel in Ixmdcm, into which 
there enter expenditures on buses, trams, electric trains and taxis, there 
is no " quantity ” of travel though perhaps we might construct measures 
on a mile^e basis. This, however, is unimportant if we know the ex- 
penditures V and the ratios pnlpn; if, for instsmce, we Imow that in the 
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;th year prices of fares on buses, trams and trains are 10 per cent greater 
than in the basic year, whereas taxi-fares have remained unchanged. So 
long as the price-relatives are known, the expenditures vrj and are 
sufficient for the computation of ,1 without the intermediate calculation 
of notionary quantities q. 

Index-numbers such as ,1 are best known as quantum indices. Ex- 
pressions such as " index of volume " occur but are misleading as the 
following example shows. 


Example 25.5. — The British Board of Trade publishes an index-number 
of the " volume ’’ of imports and exports. This is obtained by revaluing 
imports or exports in the given period on the basis of 1938 prices and 


expressing the results as percentages of the 1938 values. 

The following 

are the figures for 1946 and 1949 (1938 = 

100)— 


1946 

1949 

Imports .... 

67 

84 

Exports (including coal) 

99 

151 

Exports (excluding coal) 

107 

161 


Now it so happens in this case that we can estimate the actual weights 
(in tons of cargo) covered by these import and export figures. For exports 
(excluding coal) it is estimated that the figures were, in 1946, 98 per cent 
of 1938 and, in 1949, 120 per cent. Thus, where the quantum index gives 
161, the index based on actual weight in tons is only 120. Clearly the 
quantum index does not measure " volume " in any ordinary sense 
associated with physical size or weight alone (it may be regarded as an 
index of this weight weighted by prices). On the other hand, it may be 
the correct index to use when attention is being directed to the relative 
contribution of exports to the balance of trade, price changes being 
eliminated from the comparison with the basic year. 

25.22 In conclusion, we may intimate, without being able to pursue the 
subject, that for certain classes of statistical work it appears to be possible 
to develop a theory of index-numbers of a rather different kind from that 
discussed in this chapter. Psychologists have for some time studied 
techniques for isolating “ general factors " from a complex of tests of 
ability which are capable of application to the isolation of a " general price 
level ” from a complex of price movements. Biometricians, from a 
different point of view, have considered the problem of forming linear 
functions of observations which will most closely, in some reasdnable sense, 
summarise the essential properties of classes — the so-called " discriminant 
functions Something has already been done in applying such methods 
to the formation of index-numbers. The subject has, however, hardly 
reached the point of practical application in economics, and it is unlikely 
that the methods described in this chapter will be supplanted for general 
use. 
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* SUMMARY 

1. The price-relative of a commodity for a particular period is the 
ratio of its price in that period to the price in a basic period. It is usually 
multiplied by 100 for convenience of expression as a percentage. 

2. There is an element of convention in the definition of a price index- 
number. Simple unweighted numbers are 

. Ji = l^(PnlPr,) 

ch = { n (PnlPn) p 

3. Weighted index-numbers in common use are 

jjj = S {prj qrs) I S {prs qrs) 

log ji = S log {pfj I prs)^ 

4. The time reversal test requires that 

lab Ibt — I 

This is obeyed by but not by the weighted indices, though the latter 
may obey it approximately. 

5. The time-reversal test is obeyed by the ‘‘ ideal ” index-number 



S (pta grb) £ (Pra gw) 
S {pfi qrti) £ {prb ?m) 


This also obeys a factor-reversal test. 


6. The circular test requires that 


lab Ibe lea — 1 


It is not obeyed by any of the weighted indices unles' the weights are 
constant, but may be obeyed approximately. 

7. Linking methods may give a suitable chain index when data are 
available to make comparisons possible for adjacent srears.. 

8. Quantum index-numbers purport to measure a "quantity" in- 
dependently of price change. The principal form in common use is 

plj — ^ {Vtj pn I Pn) I 'Si {Vn) 

Quantum does not necessarily measure physical volume or wdght. 
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EXERCISES 

25.1 The following figures show the wholesale prices of refined petroleum 
per gallon in the U.K. for the years specified. On the basis of 1923=100 
construct a series of price-relatives. 

Price per gallon 
(^nce) 

13 
13* 

13* 

13 
13 

Hi* 

12 * 

12 * 

11 * 

10 * 

10 * 
lOf* 

10 * 

25.2 Show that the index-number possess the " chain-property " of 
K.18| namely that the index for a year j on base 1 is the product of 
corresponding indices of j on j—\, j—\ on j—2, .... 2 on 1. 

25.3 The following figures show for U.K. total imports {a) the declared 
value and (b) the value on the basis of average values in 1930. Taking 
1930 as a base year construct index-numbers (1) of average values and 
(2) of quantum for the years 1931-6. 


Year 

1923 

4 

5 

6 

7 

8 
9 

1930 

1 

2 

3 

4 

5 


Year 

1930 

1 

2 

3 

4 

5 

6 


Declared Value 
£ million 

1,044 

861 

702 

675 

731 

756 

848 


Value on 1930 ba^ 
£ million 

1,044 

1,067 

939 

946 

991 

1,012 

1,077 


25.4 Using the -weights of Example 25.1 calculate the index for all 
artides if the indices for the constituent groups are as follows : Food 
95 ; Rent and rates 90 ; Clothing 110 ; Fuel and light 120 ; Househdd 
goi^ 102 ; Miscellaneous 115 ; Services 98 ; Drink and Tobacco 108; 

Examine the effect of rounding up the weights (a) to the nearest 10 
(t) to the nearest 100. 
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25.5 In the notation of 25.14 consider the index-number 

*/ = i (e/*» + 

Show that if the weights in the years a and h differ by a siqall amount ir, 
the difference between this index and the " ideal ’’ index, is zero to the 
first order in it. 

25.6 The following figures give the annual average prices in the U.K. 
for beef, mutton and pork. 


Year 

Beef (prime) 

Mntton (prime) 
pence per 8 lb. 

Pork 

1935 

54 

75 

62 

6 

54 

73 

65 

7 

61 

78 

68 

8 

62 

62 

69 

9 

61 

68 

70 

1940 

72 

85 

96 

1 

72 

85 

96 

2 

76 

90 

101 

3 

79 

96 

102 


Construct an index of “ meat prices ’’ for the period (a) of type J, (6) of 
type gl with weights 4, 2 and 1 for beef, mutton and pork respectively. 
Take 1935=100 in each case. 

25.7 Show that the index-number 

r _ ^ {j’wCgw + ?>*)} 

* £ {Prb (qra qrb)} 

obeys the time-reversal test but not the circular test unless the weights 
in the three years a, b, c are equal. 
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Introduction 

26.1 When we observe numerical features of an individual or a population 
at different points of time, the set of observations constitutes a time-series. 
The temperature at a given place over a given period, the population of 
a country over a number of years, the imports of a country for a series 
of months, the weight of an animal recorded at various stages of growth, 
are familiar examples of the kind of phenomena which provide series of 
values at a succession of points of time. The statistical data which 
they furnish differ from most of the data which we have discussed hitherto 
in that we are interested, not merely in the aggregate of values, but in the 
order in which they occur. 

26.2 Throughout this and the succeeding chapter we shall consider only 

series of values given at equidistant intervals of time. By taking the 
time-interval as unit we can then regard our series as defined at times 
t—\, 2, 3, etc., and can write the values of the series as «i, u^, u^, etc., 
the value at time t being If for any reason we wish to reckon time 
backwards as well as forwards from time f=0 we can write the series as 
»-8. »*-*. « 0 . « 1 . « 2 . etc. 

The restriction as to equidistant intervals is not in practice a serious 
limitation. Most series which are available in official publications such 
as economic, demographic, and meteorological series, are in fact given 
at intervals which are exactly equal, as days, or approximately equal, as 
years, or more or less roughly equal, as calendar months. Experimental 
data are usually collected at equidistant intervals as a matter of routine 
or are recorded (as on barometric graphs) in a continuous form from which 
equidistant readings may be taken. Our discussion of theoretical questions 
is greatly simplified by assuming equidistance in the time-intervals. 

26.3 Although we shall draw all our illustrations from time-series it 
should be pointed put that the theory is also capable of application to 
certain other types of statistical data. For instance, if we put a thread 
of cotton under the microscope it presents, as we proceed along the thread, 
a fluctuating profile which tears at least a superficial resemblance to an 
oscillating time-series ; and we can regard the nitrogen content at various 
points along a strip of soil as the values of a series in which the time 
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variable is replaced by a space variable. In fact our methods are 
applicable, and are often appropriate, whenever we have a statistical 
variable depending on a variable /, whether relating to time or to linear 
space. 

26.4 In general the variable u may be discontinuous or continuous, 
univariate or multivariate. For example, numbers of human beings are 
necessarily integral and population-series are therefore discontinuous in 
the variate ; on the other hand rainfall and temperature are continuous. 
Again, we may wish to study the movement through time of one variate, 
such as the price of wheat, or of several, such as wages, employment and 
volume of industrial output. In the latter case it is usually more con- 
venient to regard each variate as yielding a separate (univariate) series 
and to study the relations between variates as the joint variation of 
several series. 

26.5 Although our time-values are discontinuous, we must remember 
that the series itself, of which they form equidistantly spaced observa- 
tions, may be either continuous or discontinuous in time. Some series 
are necessarily discontinuous. For example, the final dividends on an 
industrial security are declared once a year, usually but not always on 
about the same date, and there are no variate values between those dates. 
Again, although the act of earning an income may be carried on almost 
continuously, the remuneration received is usually paid once a week, 
once a month or once a quarter, namely at discontinuous intervals. Some 
series are continuous and may be continuously recorded, as for instance, 
by the instruments which graph on a rotating drum the temperature and 
barometric pressure in a particular locality. Between the extremes of 
unambiguous discontinuity and continuity we find numerous cases of a 
hybrid character. The price of a loaf of bread may be regarded as 
existing continuously while shops are open and even perhaps while they 
are shut ; the price of an industrial share can hardly be regarded as 
existing while the Stock Exchange is closed, and when it is open really 
varies di^continuously in the sense that on an active market the price 
may change with each transaction and hence is only determined at 
particular moments during the day. Certain quantities such as annual 
income or monthly rainfall are discontinuous in the time-variable in so 
far as there is only one value for the year or month as the case may be, 
but continuous in the sense that they are an accumulation over a con- 
tinuous period of time. Such distinctions will not often cause us difficulty 
but they provide one more illustration of the maxim, of which perhaps 
the reader may be growing a little weary by this stage, that one should 
never forget the nature of one’s primary material. 

Some examples of time-series 

26«6 We now give a few illustrations of the kind of material which we 
have to study in practice. Some examples have occurred earlier in this 
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book. Table 15.6 (Fig: 15.6) on page 359, showing the population of 
England and Wales at ten-yearly intervals, gives a typical series for the 
growth of a large aggregate of human beings. The series is smooth in the 
sense that the values lie closely about a continuous curve. On the other 
hand, the infantile and general mortality rates of England and Wales 
graphed in Figure 13.1 on page 318, though moving downwards over the 
period covered by the diagram, do not decline regularly. Table 26.1 
and Figure 26.1, showing the sheep population of England and Wales 
for certain years, give a picture of a somewhat similar kind, but the 
departures from a smooth movement are of longer duration, and it is not 
easy to decide from these data whether the increases following the low 
point in 1922 are a reversal of the downward movement or only a tem- 
porary fluctuation. 


TABLE 26.1.—Sliecp iiopulation of En^and and Wales for each year from 1867 to 1938 

Data from the AgricuUttrul Statisiics 


Year 

HU 

Year 

Population 

(10,000) 

Year 

Population 

(10,000) 

Year 

Population 

(10,000) 

1867 

2203 

1886 

1892 

1905 

1823 

1924 

1484 

68 

2360 

87 

1919 

06 

1843 

25 

1597 

69 

2254 

88 

1853 

07 

1880 

26 

1686 

70 

2165 

89 

1868 

08 

1968 

27 

1707 

71 

2024 

90 

1991 

09 

2029 

28 

1640 

72 

2078 

91 

2111 

10 

1996 

29 

1611 

73 

2214 

92 

2119 

11 

1933 

30 

1632 

74 

2292 

93 

1991 

12 

1805 

31 

1775 

75 

2207 

94 

1859 

13 

1713 

32 

1850 

76 

2119 

95 

1856 

14 

1726 

33 

1809 

77 

2119 

96 

1924 

15 

1752 

34 

1653 

78 

2137 

97 

1892 

16 

1795 

35 , 

1648 

79 

2132 

98 

1916 

17 

1717 

36 

1665 

80 

1955 

99 

1968 

18 

1648 

37 

1627 

81 

1785 

1900 

1928 

19 

1512 

38 

1791 

82 

1747 

01 

1898 

20 1 

1338 

39 1 

1797 

83 

1818 

02 1 

1850 1 

21 

1383 

1 

i 


84 

1909 

03 

1841 

22 

1344 

1 


8$ 

1958 

04 

1 

1824 

23 

1384 




26.7 The two last examples exhibit not only local vari^on but a broad 
movement over the period, a trend as we may call it. In our next three 
examples there is no apparent trend but varying degrees of " short-term ” 
or " local ” variation. Table 26.2 and Figure 26.2 show the percentage 
losses of British ships per annum (i.e. 100 times the tonnage lost divid^ 
by the tonnage at risk). There is a good deal of variation from year .to 
year but it is not verj'^ regular, at least so far as the eye can judge. In 
Table 26.3 (Figure 26.3) showing the crude birth-rates of cattle in Great 
j^tain on a quarterly basis there is, in contrast, a marked regularity due 








to tile seasonal character of births of cattle. have 

been seasonal effects in the data of Table 26.2, but if so they have been 
obliterated by the use of annual figures. Table 26.4 and Figure 26.4 
show a th 3 rthm'in numbers of sunspots which is not seasonaL It is not 
so regular as that of Table 26.3 but there is evidently some degree of 
regularity present. 
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Fig. 26.2.~-Graph of the data of Table 26.2 

26.8 Examples such as these lead us to regard a time-series as composed 
of three constituent items, a long-term movement or trend, a short-term 
systematic movement and an unsystematic or random component. Some 
series, of course, do not exhibit all three — the movement shown in Figure 
15.6 is nearly all trend, that of Fig 26.3 is nearly all systematic oscillation, 
and that of Figure 26.2 seems on the face of it to contain a good deal of 
random fluctuation. One of our principal problems is to isolate these 
components for separate study. 

TABLE 26.3. — Crude birth rates (number of births per 100 population) of cattle in 

Great Britain 

Data from Joan Marlcy, J, Roy. Stat. Soc., 110., 187 

The figures have been multiplied by a factor of approximately four to make them 
comparable with annual rates 


Year 

December- 

Febniary 

Birth rate 

March- 

May 

June- 

August 

September- 

November 

1940 

33-2 

45-2 

33-2 

J0*0 

1 

35-2 

44-0 

38*8 

32*8 

2 

35-2 

46-4 

35-6 

34*4 

3 

34-8 

44-8 

32*0 

38*4 

4 

37-6 

41-2 

32*8 

36*8 

5 

360 

420 

30*0 

35*2 
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Year 

Fig. 26.3.— Graph of the data of Table 26*3 


TABLE 26.4.- Wolfer’s sunspot numbers for the years 1853-1900 

Ouoted by G. Udny Yule, PhU. Tram. A, 226, 267 


Year 

Number 

Year 

Number 

Year 

Number 

Year 

Number 

1853 

39-0 

1865 

30-5 

1877 

12*3 

1889 

6*3 

4 


6 

16-3 

8 

3*4 

1890 

7*1 

5 

6-7 

7 

7-3 

9 

6*0 

1 

35*6 

6 

4-3 

8 

37-3 

1880 

32*3 

2 

73*0 

7 

22-8 

9 

73-9 

1 

54*3 

3 

84*9 

8 

54<8 

1870 

139*1 

2 

59*7 

4 

78*0 

9 

93-8 

1 

111*2 

3 

63*7 

5 

64*0 

1860 

95-7 

2 

101*7 

4 

63*5 

6 

41*8 

1 

77-2 

3 

66*3 

5 

52*2 

7 

26*2 

2 

59*1 

4 

44*7 

6 

25*4 

8 

26*7 

3 

44 0 

5 

17*1 

7 

13*1 

9 

12*1 

4 

470 

6 

11*3 

8 

6*8 

1900 

9*5 


26.9 One initial word of warning is necessary. It is useful to isolate 
the components of a series for sundry purposes. We may, for instance 
be interested in the broad movement of a series and hence concentrate 
attention on the trend to the exclusion of local and casual variation. But 
this does not necessarily mean that we can in a parallel manner isolate 
the causal systems underlying these movements. As a pure matter of 
description we may ignore local variations and consider the trend ; but we 
must not mislead ourselves by supposing that there is some fundamental 
cause or set of causes which generates the trend movement and another 
distinct set which accounts for the local movements. This is sometimes 
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SO, but uot always so. We shall give later in the chapter (Example 26.4) 
an example of an artificial series which reproduces most of the features 
of the so-called trade cycle, namely a series of long swings on which are 
superposed more erratic short-term movements, but which is not composed 
of a trend-generator and a short-term generator. 



ng. M.4.— Graph of the data irf TaUe 2S.4 

26.10 We may also remark at this stage that the distinction between a 
long-term and a short-term movement is to some extent arbitrary. The 
so-called trade-cycle is a long movement for most business purposes, the 
depressions and peaks occurring about once every ten years on the average. 
But in considering the recurrence of ice-ages or the growth and decay 
of civilisations, ten years would be a very short time. What we call ar 
trend in any partic^ar case is a matter of choice. It would be iffbre 
accurate to sp^ of long-term or short-term movements and even then 
it is a convention what length of time we regard as long or short. 

Triad 

26.11 The general notion of trend as a broad continuous motion of the 
system leads us to confer the possibility of representing it by a poly- 
nomial in the time-variable t. The representation of a set of values 
«j, . . . tta by a parabola of the form 

Wi =* . . .. (26.1) 

has already been considered in Chapter 15 and we need add little to what 
was said there on the subject. In Example 15.5 we did, in fact, fit 
cubic parabola to the population data of Table 15.6 and obtained a very 
ildrfit. 
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26.12 This method of trend determination has some serious drawbacks 
when, as in Table 26.1, the polynomial required to obtain a good fit is 
of high order. The arithmetic becomes troublesome ; the higher order 
terms of the polynomial tend, as we pointed out in 15.22, to wag the 
tail of the curve ; and if at some stage we add further terms to the 
series (as frequently happens when new data arise by the passage of time) 
the work of fitting has to begin afresh. The object of polynomial fitting 
can be attained by a simpler process known as the method of moving 
averages. 

Moving averages 

26.13 Consider the first 2w+l terms of the series, where m is a number 

which we can choose at will. We may fit a polynomial of order p to these 
terms and by convention will take our origin at the (w+l)th term, i.e., 
the middle one. Our polynomial, fitted by the usual method of least 
squares given in chapter 15, will then l>e of the type (26.1) and we may 
determine the constants by such equations as 

. . . -.tf^E(tf+^)==0 . (26.2) 

there being (^+1) of these equations corresponding to values of j from 
0 to p, and the summations extending over the values of t from — m 
to + w, (Compare equations (15.8) on page 344.) 

This polsmomial is the best fit, in a leatst squares sense, to the first 
(2i«+l) terms of the series and we may therefore take it as determining 
the trend value at the origin, that is to say at the (iii+l)th point. The 
trend value is then obtained by putting ^==0 in (26.1) and reduces simply to 
We need therefore only determine from the equations (26,2). 
The other constants a are not required. 

It should be noted that the sums occurring in (26.2) are simply sums 
of the integers or their powers from —m to m and hence depend 
only on m and p, not on the values of except in the case of the first 
term S(i#| ii). It then follows that when we solve the equations for 
we shall obtain a linear expression in the values of the type 

a^ = 

where the b*s depend only on m and p. This expression is 
weighted average of the first (2m+l) values of the series, t|ta^^H|is ft" 
being determinate once we have fixed m and p, 

We may now repeat the process by moving along the series and fitting 
a curve to the (2w+l) points from to determining a trend 

value corresponding to the point (the middle one of this set) ; and 
since our treatment remains the same except for changes in the valuer 
of the the trend value will be given by 

where the Vs are the same quantiti^ as were re^ed in equatioti (26.3). 
0 
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We then proceed one step further along the series and repeat the process ; 
and so on. 


26.14 The net result of this treatment is that once we have determined 
the constants b we can ascertain trend values by a weighted average of 
sets of (2W-I-1) consecutive terms. We take in fact, a moving average 
along the series. There will be no values corresponding to the first m 
or the last m terms of the series and we must either resign ourselves to 
having no trend for these 2m terms or adopt special measures to obtain 
them. Our trend values will " smooth ” the series in the sense that they 
correspond to values of best fit given by pol 3 momials of local application. 
The process of trend determination is often described as " smoothing.” 


26.15 Let os consider the simplest case when we fit straight lines to 
sets of three points, (m=l, />=!). Our polynomial is then simply 
and we have to minimise the sum of squares 

1 

Z {«<— a#— Ojl)* 


which leads to the equations 


S(«j)— Sa,— air(<) = 0 I 

Z(f«,)-aoZ(f)-a,E(<*)=0 }’ 

Now 2(0 ==0 and in general E(/^)=0 whenever p is odd. 
simply from the first equation of (26.4) 


. . (26.4) 

We then have 


a. = JL(«<) 


= • • • (26.5) 


In ^ort, our trend value at any point is simply the arithmetic mean of the 
three values of « centred at that point. 

Consider next the case when we fit straight lines to sets of 

2m^l .points. Corresponding to the first equation of (26.4) we shall have 

{2»«+l)«o = 6 

leading to- 

** ~ ‘ ‘ ‘ ^ * * ( 28 *^ 

Ifi simple gmieralisaticm of the previous case we then have the resalt 
ttkat the trend value at any point is the arithmetic memt of the 
y«htes centred nt that poinit, 
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26.17 The next case in order of complexity is the fitting of a quadratic 
parabola to sets of 5 points (^>=2, m=:2). We then have to minimise 

2 

Z («j— «o— 
t --2 

and remembering that Z(f^)=0 for odd p we arrive at the equations 

Z(««) -5a, -a^(f*) = 0 

2(<«,) = 0 . (26.7) 

-a.S(<*) = 0 

Now E(i*)=10 and X(t*)=3A. The relevant equations are then 
S(«,) — Sa#— 10a, = 0 
£(/»«,) -lOa,— 34a, = 0 

leading to 

= ^|_3«.,+12«-i+17«o+12«j-3«,| . . (26.8) 

26.18 Proceeding in this way, we can determine the weights appropriate 
to any system of tn and p. The values of the weights for the cases required 
in practice, however, have been worked out and the simpler ones are 
given below. Let us note two properties of any system of weighting 
given by this method — 

(а) The sum of the weights is unity. This follows from the fact that 

the sum in such an equation as (26.8) is obtained by putting all the «’s 
equal to unity. If we do this in the first equation of (26.7) and equate, 
all the other a’s of even order to zero (as we may, since in this case a ' 

line gives a perfect fit) we see that a,=l. 

(б) The weights are S 3 rmmetrical about their middle value. Thislallows 
from the fact that we most obtain the same result if we start from the end 
of the series and work backwards. 

We can then write a series of weights such as those of equation (26.8) 

in the form ^ [—3, 12, 17, . . . ]. Those of (26.5) would similarly be 
35 

written ^ [1, 1, . With this notation we can now write down, 

3 

without proof, the weights for the ampler cases. 
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1 (straight line) — 

1 

2m+l 


[ 1.1 1 ] 


(26.9) 


^ = 2 or 3 (quadratic or cubic) — 

Values of m 

[-3, .12. 17, . . 

[-2. 3. 6. 7. . .] 

>. . (26.10) 

[-21, 14. 39. 54. 59 . . .] 

[—36, 9, 44. 69. 84. 89, . 

^ = 4 or 5 (quartic or quintic) — 

Values of m 

3 [5. -30. 75, 131, . . .1 

4 [15, -55, 30, 135, 179, . . .] (26.11) 

4iK7 

5 iL [18- -'*5> -10* ‘20, 143, . . .] 


4 

5 


35 

2 

21 

1 

231 

J_ 

429 


The reader will note that the same formulae are obtained for p=2Ji+l as 
for p=2k. We leave it as an exercise for him to examine why this is so. 

JKL19 It is evident that expressions such as this rapidly become rather 
Cj H aB b f p q s . We shall consider below how they may be simplified by 
appnflEtoation, but before doing so will give a numerical example. 

Exam^ jftl 

To fit a trend line by moving averages to the sheep population data 
of Table 26.1. 

Let us first take a simple average of the type (26.9). We have to decide 
(m the extent of the average, namely the number m. Our process will 
be sufficiently clear if we fit a curve to the first forty terms of the series 

o»Iy. 

There is, at this stage, no golden rule which can be laid down for tiie 
determination of the extent of the average. We can only try a lew values 
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and see if they give ns the kind of trend line we want. Let ns then take 
two values, w=2 and m=4 (corresponding to extents of 5 and 9 terms 
respectively). 

For the moving average of 5 we have to sum consecutive sets of five 
terms and divide by S. The process is illustrated in Table 26.5. It is 
very simply carried out because in moving on a step we have only to add 
on one term to the sum of five at the end and take off one at the beginning. 
A similar process gives us the moving average of nine terms. Figure 
26.5 shows the result of fitting the two trend lines. 

TABLE 26.5. — lUnstratioii of the arithmetic of fittiiig a simiile moviog average of fives 

to the data of Table 26.1 


(1) 

Number of 
term, t 

(2) 

Value of 
series 

Ut 

(3) 

Sum of 

consecutive sets 
of five values 
of Ut 

(4) 

1 of column 

(3) 

(5) 

Deviation, 
column (2) 
less column 
(4) 

1 

2203 




2 

2360 




3 

2254 

11006 

2201 

53 

4 

2165 


2176 

- 11 

5 

2024 


2147 

-123 

6 

2078 

10773 

2155 

- 77 

7 

2214 

10815 

2163 

•51 

8 

2292 

10910 

2182 

no 

9 

2207 

. . . 

' • • • 

... 

10 

2119 


i ... 

1 

. . . 


Now let us try fitting a quadratic to consecutive sets of 7 points. The 
appropriate formula is, from (26.10) 

^ [—2. 3, 6, 7, ... ] 

This is not nearly so easy to apply as in our first case. We shall have, 
for the initial term corresponding to /=3 

^ {( -2 X 2203) +(3X2360) +(6 X 2254) +(7 X 2165) 

+(6 X 2024) +(3 X 2078) -{2 x 2214)} « 2157 

and a new calculation of this kind has to be done for eadt tom of the 
trend line. The process is straightforward but tedious. It may be 
facilitated by the construction of a terajdate which leaves only sevra 
consecutive terms exposed to view, so that the eye does not pack out the 
wrong terms in machine catoulations. 

u* 






622 


TBEORY OF STATISTICS 



NumHr of terms 


Fig. 2S.S 

We have shown in Figure 26.5, the result of applying this process to the 
series of Table 26.1. 

An examination of this diagram will reveal the conventional nature 
of the determination of trend. The 7-point quadratic is not, for most 
purposes, a good trend line because it follows the primary data too closely 
and reproduces short term fluctuations. The lit is too good. The same 
is true, though to a smaller extent, of the moving five-year average. On 
the other hand the simple nine-year average seems to have the sort of 
properties we require to describe the general trend._ We might have 
guessed this at the outset by noting that the major fluctuations seem to 
cover a period of about six years on the average so that a moving average 
of at least six successive terms is required to smooth them out. See also 
Example 26.5. 

i^pfoadaiate fonnula 

26.20 By far the simjflest kind of moving average to apply is the one in 
which all weights are equal, and it is possible to simulate the accurate 
formulae of (26.10) and (26.11) by repeated simple moving averages. For 
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instance if we apply a simple average of threes to a series we have a series 
typified by g(«i+«a+M*) : and if we apply a simple average of threes 
to this series we have as a typical term 


1 

3 


-(Mi+Mj + Ms) +g(“2+*<3+«4) +g(**»+**4+**l)| 


= g{Ml+2«j+3«3+2»4+«j} 

= 1[1. 2. 3 ] 


. (26.12) 


The coefficients here follow more the pattern of (26.9) in that instead of 
being equal they rise to a maximum at the middle member. We state 
without proof that for many purposes great accuracy in the weights of 
a moving average is not necessary, so that formulae of the kind of (26.12) 
may be used as substitutes for the accurate formulae without serious loss 
of efficiency. 

Two formulae of general use in actuarial work are known as Spencer's 
15-point and 21-f)oint formulae. The weights are as follows — 

Spencer s \S-point formula 

Writing for a simple moving average of k terms, we have for this 
formula 


32oW*[5] [-3. 3. 4 ] 

= ^[-3, -6, -5, 3. 21. 46, 67. 74 ] . (26.13) 

Spencer's 2\-point formula 

3-5o[5]H7] [-1.0. 1.2. . . .] 

= 35o(-l. -3, -5, -5, -2, 6. 18. 33, 47, 57. 60) . (26.14) 

These are accurate as far as third differences, i.e, they reproduce a cubic 
exactly and will provide a good approximation for higher order curves. 
The Advantage in using them lies in the fact that most of the arithmetic 
can be carried out by simple summation. For instance, with (26.13) we 
first of all find a moving sum of fours, then a second moving sum of fours 
of the result, then a moving sum of fives of that result, and finally apply 
the moving average of fives (-3, 3, 4, . . .) and divide by 320. This is 
much more rapid than carrying out the moving average in one stage by the 
weights given on the right hand side of (26.13). 
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The statistician will rarely require closer fits than are given by these 
formulae and frequently even they are too good in the sense noted in 
26.19. A simple moving average often gives him what he requires if his 
series fluctuates ; if it constantly moves in the same direction so as to 
remain always concave or convex to the f-axis a simple moving average 
will systematically under- or over-shoot the mark. Compare Exercise 
26.14. 

26.21 We have chosen the number of points to which a polynomial is 
fitted to be odd. This is convenient because in the contrary case either 
the middle of the fitted range falls between two time-points or we have 
to fit a polynomial asymmetrically. Where, however, it is essential to 
fit to an even number of consecutive points we can easily do so by a slight 
modification of the technique. Consider the case of data given by quarters 
over a series of years. To eliminate seasonal effects the natural thing 
to do is to take a moving average of fours, but this gives us a set of values 
which do not correspond to the time-points of the original data. If, 
for instance, the information is an average over each quarter, the 
quarterly figures relate on the average to the middle of quarters and a four- 
point moving average will give values at the end of quarters. This may 
be adequate for our purposes. If not, we can centralise the trend 
values by taking a four-point moving average and then a simple mean 
(a two-point average) of the result. For instance, with a series starting 
with the first quarter of 1948, a four-point average will give figures relating 
to the end of June, the end of September, the end of December, 1948 and 
so on. A simple average of pairs of the result will give figures relating to 
the middle of August (the third quarter), the middle of November (the 
fourth quarter) and so on. In effect, what this process amounts to is the 

replacement of the scheme ^[1 , 1 , 1, 1 j by ^[1 , 2, 2, J as the reader can 

4 o 

readily verify. An example is given below (Example 26.2). 

Eiiiniiiatioii of seasonal effects 

26.22 A great many time series, particularly in economics and 
meteorology, are affected by the seasons. Similarly, other natural 
rhythms of shorter duration generate periodic effects such as the daily 
rise and fall in temperature at a given spot or the variation in tides at a 
port. Man-made periodicities may also appear, as in the change in the 
nature of road traffic at week-ends, or the rise in current bank balances 
at the end of the month. For simplicity we may term all sudh variations 
“ seasonal ” where they correspond to indentifiable and strictly periodic 
rh 3 dJims in the causative system even though the period is not one year. 
The student should beware of regarding an oscillatory movement as 
“seasonal" (i.e. strictly periodic) merely because it prints some 
appearance of regularity. 
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26.23 Our object in considering seasonal effects may be either to get rid 
■of them in order to concentrate on the remaining variation or to isolate 
them for separate study. Elimination is a simple matter if we are prepared 
to extend our time-interval to cover a complete period of the seasons. 
For instance, we can eliminate any seasonal effect in records of sheep 
population by observing that population at a fixed date each year. The 
same stage of breeding and slaughtering may not quite be attained on the 
given date in different years but variations from it will be small and erratic. 
Again, we may eliminate seasonal movements in rainfall by recording 
only the total occurring in each year, the resulting series of annual figures 
containing no seasonal effects. Methods like these, of course, " eliminate ” 
seasonal movement only in the sense of choosing a longer time-interval 
which covers one or more complete seasonal cycles ; they do not record 
for each part of the year what the value of the series would be if the 
seasonal part of the movement were abstracted, and to that extent they 
sacrifice information. 

26.24 To fix the ideas, consider a series of monthly prices of a commodity 
such as eggs. This series has a definite seasonal movement but also may 
move from year to year independently of the purely seasonal effect. A 
simple 12-point moving average is dften sufficient to smooth out seasonal 
variation but where enough data are available we may also take the 
calculations further as in following example. 

Example 26.2 

The average monthly prices per 120 eggs in England and Wales in 1927 
and January 1928 were as follows— 

(1927) Jan. Feb. Mar. Apr. May June July Aug. Sept. Oct. Nov. Dec. 

Price 

(pence) 236 232 147 132 131 145 164 200 232 294 327 296 

(1928) jan. 

286 

The average of the prices for the 12 months of 1927 was 211 pence. The 
monthly prices relate approximately to the middle of the month, (being 
averages covering the whole, month) and this average over the year there- 
fore gives a range centred at the end of June. The average for the months 
Feb. 1927-Jan. 1928 inclusive was 215 pence and this relates to a period 
centred at the end of July. We therefore take as the appropriate value 
for the middle of July the mean of 21 1 and 215, namely 213 pence. This 
is the 12-month " centred ” moving average or " trend-value ” for ^uly 
1927. 

The actual price for July was 164 pence and hence this price, ais a 
^rcentage of the 'trend-value'* is 16400/213^=77.0. Calculations on 
these lines for the years 1927-1936 are shown in. Table 26.6. 



26b^^Percentage rotation of actttal agg prim to 12-iiiontli moving averages and seasonal Indices derived therefrom 

Data frcMn C.T. Houghton, J.R. StaL Soc., Uli 275. 
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In column (11) of this table is shown the average of the monthly indices 
for each month ; and column (12) scales these figures down very slightly 
so as to make them add up to 100*0. The results may be regarded as 
an index of the purely seasonal part of the egg prices. The January 
figure, for instance, indicates that on the average over nine years the 
January price was 109*3 per cent of the trend-value for January, or that 
seasonally prices are increased by 9*3 pq( cent in that month. 

Let us now return to the prices for January-December 1927 quoted at 
the beginning of the example. These include an element due to the seasonal 
effect. Suppose we wish to eliminate seasonality in order to study whether 
there was any " real ” change in the price of eggs over the year. We 
then divide the January price by 1 *093, the February price by 0*973 and 
so on to obtain — 

Corrected Jan. Feb. Mar. Apr. May. June July Aug. Sep. Oct. Nov. Dec. 
price 

(pence) 216 238 209 213 203 202 194 195 211 216 207 212 

These may be regarded as the prices " corrected ” for seasonality. The 
movement over the course of the year, apart from seasonal effects, is 
obviously slight. 

Change in price-level 

26.25 As we have noted in connection with index-numbers special points 
arise when our series are expressed in terms of money owing to the change 
in the value of the unit over a long period. We may, therefore, wish to 
remove from a series of prices a trend in the general price-level. This is 
not the same thing as removing a trend in an ordinary series ; there we 
are concerned with long-term changes in the numbers of units, whereas 
here we are concerned with changes in the unit itself. The procedure 
customary in such cases is to divide the actual price by an index of general 
prices, or the price of gold, or some similar figure expressing the value of 
money ; alternatively we may revalue on the basis of prices in some 
standard year when our series relates to a “ basket of goods We have 
noticed this latter process in Chapter 25. The former is illustrated in 
Table 26.7. Column (2) shows the net national income per head of pop- 
ulation in the United Kingdom. Column (3) gives an index of prices on 
the basis of 1900=100. These figures are used to " correct for price 
changes " or to eliminate trends in prices to give column (4) which thus 
provides figures for income per head of a more comparable kind. 

Hie effect of trend dimination on other Aments 

26.26 The success or failure of a method of determining trend is to be 
judged by results so far as the trend itself is concerned ; diat is to say, 
by whether it gives a sufficiently broad general picture of the movement 
of the series for our purposes. But if our object is to eliminate trend in 
order to study short-term movements in the series we have to be most 
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careful that the residuals do not reflect the nature of the trend fitting 
rather than any intrinsic property of their own. In no branch of 
statistics do we have to guard so much against projecting our pre-conceived 
ideas into the data by the technique of analysis we adopt. 

Example 26.3 

Figure 26.6 shows the residuals given by two of the three methods of 
curve fitting derived in Example 26.1, the 9-point simple average and 
the 7-point quadratic, (By residuals we mean the deviations of the actual 
series from the trend values). Evidently the magnitudes of the deviations 
\re very different in the two cases so that if we are interested in the size 
of the residual fluctuation our result depends very much on which method 
of trend-elimination we use. On the other hand, there seems to be a 
regularity in the oscillatory movement which is common to both series 
so that any judgment as to the period of the short-tenn movement would 
probably be very much the same whichever method of eliminating trend 
we had adopted. 

26.27 Suppose that a series consists of the sum of three components, a 
trend, an oscillatory movement and a random element. Our method of 
trend elimination by moving averages evidently acts separately on these 
three components ; if, therefore, it eliminates the trend perfectly we 
shall be left with residuals which are the same as if we had applied the 
method to a series consisting of the sum of an oscillatory and a random 
component. Let us consider the effect of the method on such components. 


26.28 Consider an oscillation which is given by the terms of a sine-series 

= sin 

where a and A are constants. Such a series gives a harmonic wave of 
period A. In most text-books of trigonometry it is proved that 

,^1 /^} • ( 26 . 16 ) 




. (2o.l5) 


Thus a simple moving average of k terms will result in a sine-series with 
the same period as the primary series but with amplitude reduced by 
the factor 


1 sin nk/\ 
k sin ff/A 


. (26.17) 


If the process is repeated q times the amplitude is reduced by the ?th 
power of this (|uantity. 

It then k is large or nk /A is an int^al multiple of n, the expression 
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(26.17) is zero or small. Thus the " trend ” determined in the oscillation 
is small and the residual only slightly affected. But if A is large and k /A 
is small the term (26.17) is nearly unity (since sin 0=6 approximately 
for small 6) and hence the residual will be very small, most of the primary 
variation being eliminated as trend. 

26.29 This is what we might expect on general grounds. If k /A is small 
and A is large the oscillation has a large period, i.e. is a very slow one and 
is treated as trend by the moving average. If the period is short compared 
with k the residuals are only slightly affected. 

In general, we may expect from this analysis that a moving average 
will emphasise the shorter oscillations at the expense of the longer ones. 
It is interesting to note that in some circumstances (26.17) may be negative 
so that the oscillation in the residual may be even larger than in the 
primary series. 

26.30 Consider, again, the effect of a moving average on a random 
series with zero mean. To fix the ideas, consider a moving average 
of fives. Two consecutive values of the trend would be typified by 

i (®i4'fi«4'68+®4+®8) I (ej+e,-t-64+®s+®«) (26.18) 

The variance of this series is | var c and the covariance (since the residuals 
are independent) is 

E (e»*+€,*+e4*+ej*) = 5^ var e 

Thus the correlation between neighbouring terms is 4/5. Similarly the 
correlation between terms 1, 2, 3. 4 members apart is 3/5, 2/5, 1/5, 0. 
Hence the values of the " trend " will tend to be smooth ; and when we 
subtract the trend from the original series we shall get a smooth component 
on which is superposed a random series. The effect of trend dimination 
is therefore to insert in the residuals a smooth component which, in 
general, will exhibit oscillations. We have to take care, accordingly, 
that when we detect " oscillations ’’ in a series from which trend has 
been eliminated by moving averages, the oscillations are not spurious. 

Example 26.4 

Figure 26.7 shows the results of a smoothing ^ [5][3] on a set of ^ 
random numbers which could vary from 0 to 19 inclusive. They wibre 
obtained from the numbers on page 376 by reading two figure numbers 
downwards and omitting multiples of 20, e.g. the first numbers ari(| 9, 1, 
3, 5, 7. The resemblance to the vague fluctuation of a trade cj^e is 
evidoit. 
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Number of iei'ms 

Fig. 26.7. — Smoothing by a ^ [S] [3] average of a random leries 


Vaxiate-dififerencing 

26.31 As in the case of curve fitting (Chapter 15) the reader may wonder 
how he is to find out in any particular case what sort of moving average 
to use. If he is interested in trend the answer is as indicated in Example 
26. 1 . But if he is interested in residuals the answer is much more difficult. 
We will indicate in broad outline a method which has as its object the 
detection of random variation e and the estimation of its variance and 
which indicates at any rate an upper limit to the degree of the trend line. 

Suppose a series consists of a polynomial of degree r plus a random 
element. Then if we take first, second, third differences etc., the resulting 
series consists of a polynomial of degree r— 1, r— 2, etc., plus a residual 
which increases in variance. We have, for instance, after the manner 
of 24.15 



£“ (Ae,) =£ (e«+i— e,) =0 . 
var (Ae.) = E (€«+x-6,)* 

. (26.19) 


2 var e 

. (26.20) 

Similarly 


var {A*e,) = 6 var e 

. (26.21) 

and generally 


var ( r } ^ ' 

. (26.22) 
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The effect of differencing is then to enhance the short-term, movements 
at the expense of the long term movements and in particular to multiply 
the purely random element until it swamps all the others. (There is an 
exception to this rule if the systematic part of the series has a short 
period of two or less, for this is not reduced by differencing, as may be seen 
by considering the series 1, --1, 1, —1, etc.) This gives us a method of 
estimating the variance of a random element superposed on a series 
which can be represented (perhaps only locally) by a pol 5 momial. The 
variances of the first, second, . . . rth difference (or better, the second 

moments about zero origin) are divided respectively by 2, 6, . . . 

and if this quotient seems to be approaching a limit, the limiting value 
provides an estimate of var e. Further, the degree to which we have had 
to go is some indication of the degree of the systematic part of the curve. 

Example 26.5. — Consider again the sheep data of Table 26.1. A 
calculation of the differences would proceed as follows — 



2203 

2360 

2254 

2165 

2024 


-157 

106 

89 

141 


A» 

-263 

17 

-52 


A» 


-280 

69 


etc. 

etc. 


The sums of squares of the differences A' are shown in the following 
table. Column (3) shows the number N of terms on which they are 

based, and column (4) the ratio L that is to say the ratio 

which we expect to tend to the variance of e. 


TABLE 26.8. — VariateHUBrnnce aaalydf of tbe data of TaUe 26.1 


(1) 

Order of diderence 
r 

(2) 

Sum of squares of 1 
A' 

(3) 1 

1 Number of terms 
in sum N 

(4) 

Column (2) 

1 

499,356 i 

72 

3468 

2 

614,333 

71 

1442 

1 3 

1.195,999 

70 

854 

4 1 

3.037.326 

69 

629 

5 1 

8.883.670 

68 

518 

6 

27.735.006 

67 

448 

7 

90,957.010 

66 

402 

» 

I 310,670.360 

65 

371 

e 

i 1.110,091.780 

64 

357 

10 i 

; 4.043.696,988 

1 63 

1 

347 
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We also find, for the original series 

£(«,) == 135,537, = 267,800,918. 

whence we have for its variance 

/tj = 272,229. 

A comparison of this figure with the fourth column of Table 26.8 shows 
that the variation is very substantially reduced by the first two or three 
differencings. Wc should be justified in concluding that the data can be 
represented locally by a polynomial of the third or fourth order, e.g. by 
a moving cubic or quartic and that the error e (regarded as superposed 
on this systematic representation) has a variance of about 500. 

What we have said above about the adequacy of a trend line is in 
no way affected by this result. The present example tells us that if the 
data consist of a polynomial plus a random element, there is no need 
to seek for a poljmomial of degree higher than four. It indicates that 
we should be wasting our time in trying to fit quintic or higher order 
curves (or in using moving averages based on quintics, etc.). It does 
not say that a quartic is the best trend line for the purposes of a broad 
description of the trend ; a simple curve might be more suitable in 
particular circumstances. 


SUMMARY 

1. For descriptive purposes the most general form of univariate time 
series may be regarded as composed of trend, short-term systematic 
movement and random or haphazard components. 

2. This analysis sometimes corresponds to different causative systems, 
but not always so. 

3. A convenient method of trend determination is to use moving 
averages. The weights can be determined by least squares and approxi- 
mations to the exact weights are legitimate and useful. 

4. Seasonal effects, i.e. movements occurring in a strictly periodic 
manner, can be removed or isolated by a special method. 

5. Moving averages may distort short-term components and generate 
spurious oscillatory movements in random components of a time-serieS, 

6. Variate-differencing can be used to estimate the variance of the 
random component of a seiiies on the assumption that the other components 
can be represented (at least locally) by a polynomial in time and that 
no periodic movement is present with a period of two intervals or less. 


EXERCISES 

26.1 Determine a trend line by a ^ple moving average of nines in 
the data of Table 26.1 for the years 1905 to 1939. 
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26.2 The values of a series %...«, are plotted on a diagram in the usual 
way with t as abscissa. The points corresponding to and m, are joined 
and the line joining them bisected, giving an ordinate of say, v^. The 
process is repeated by bisecting the line joining Uf and «, to give ; 
and so on along the series. 

The procedure is repeated with the series ti„ »*,... to give a series 
Show that (W(-f2M|+i+W(+i). Examine the suitability 
of this procedure as a method of determining a trend line in the data of 
Table 26.2. 

26.3 The following are the fibres for the infantile mortality rate in j 

England and Wales (deaths of infants under one year of age per 1,000 \ 
live births) — ) 


Year 

Rate 

Year 

Rate 

1922 

77 

1935 

57 

3 

69 

6 

59 

4 

75 

7 

58 

5 

75 

8 

53 

6 

70 ' 

9 

51 

7 

70 

' 1940 

57 

8 

65 

1 1 

60 

9 

74 

2 

61 

1930 

60 

3 

49 

1 

66 

4 

45 

2 

65 

5 

46 

3 

64 i 

6 

43 

4 

59 i 

1 

1 



Fit a simple moving average of fives to this series and apply a further 
simple moving average of fives to the result. 

26.4 The following is the rainfall in inches in England and Wales for 
certain months — 



Avearge 
1881 -1915 

1943 

1944 

1945 

1946 


2*99 

6*2 

3*3 

3*4 



2*57 

1*8 

1*7 

3-1 



2 67 

0*9 

0*5 

1*3 


Apr. . 

212 

1*4 

2*2 

1*7 


May 

2*30 

3*2 

1*5 

3*2 

3*0 

Tone 

2*44 

2-3 

2*5 

3-2 

3*4 

July . . 

2*87 

2*2 

2*8 

2*6 

3*1 

Att? . 

3*35 

3*2 

3*2 

2*8 

5*4 

Sept 

2*54 

3-4 

4*2 

2*5 

4*9 

Oct. . 

3*97 

3*4 

4*5 

4*1 

1*5 

Nov. . 

3*49 

2*7 

6*1 

0*8 


Dec. 

3-92 

2*1 

2*8 

4*1 

4*0 


35*23 

32-8 

35*3 

32*8 

41-8 












TIME -SERIES 


635 


Using the average of the period 1881-1915 as a norm derive monthly 
index numbers, for the period 194S-6 of the rainfall “corrected" for 
seasonality. Graph your results. 

26.5 If the smoothing formula [—2, 3, 6, 7, . . . ] is applied to a 
random series, find the correlations between members of the smoothed 
series 0, 1, 2, 3, 4, 5, 6 members apart. 

26.6 Construct ten terms of the series whose value at time t is I®— 2<+5 
for f=0, 1, ... 9. Verify that the formula 

k [-3, 12, 17 ] 

gives an exact fit to such a series, 

26.7 Take the random digits of 16.30 as random numbers which can 
vary from 0 to 9 with equal frequency in the long run. Take a simple 
moving average of threes of the first 50 terms, then a simple moving 
average of five of the resultant, then another simple moving average of 
five of that resultant. Note the appearance of smooth series fiom the 
repeated averaging. 

Write down the coefficients of the smoothing process if carried out in a 
single stage. 

26.8 The following is an index number of the price of lead from 1926 
to 1945 together with the '' Statist " wholesale price index for the period. 
Construct an index of lead prices " corrected " for changes in the whole- 
sale price level. 


Year 

Index numbers 

Year 

Index numbers 

Wholesale 

prices 

Lead 

Wholesale 

prices 

Lead 

1926 

125 

157 

1936 

88 

95 

7 

122 

125 

7 

102 

121 

8 

119 

109 

8 

90 

813 

9 

114 

117 

9 

94 

85 

1930 

96 

95 

1940 

128 

127 

1 

82 

71 

1 

142 

129 

2 

79 

63 

2 

151 

129 

3 

78 

65 

.3 

15$ 

129 

4 

81 

61 

4 

160 

129 

5 

83 

78 

5 

164 

142 
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26.9 For a series in which the values are represented by a cube or lower 
power of the time variate t show that, if ^ [A] is written in brief for a 
simple moving average of k terms. 


1 (A»-l 


[k] 


k*-l 

h 



gives an accurate trend line. Hence show how, by two simple moving 
averages, we may obtain a trend formula which will be correct to the 
third degree in the fitted polynomial. 

Obtain the formula when h—S, A =3 in the form 


10 [ — 1 « 4 , 4 , ... ] 

26.10. By considering the series (f— 2)*, (<—!)*, . (f+2)* show that 

the formula 

[ 6 , - 46 , 1 + 66 . - 46 . 6 ] 


accurately reproduces a cubic curve for any value of 6. Show further 
that if this formula is applied to a random series the correlation between 
neighbouring members in the resultant " trend ” is 

- 86 ( 1 + 76 ) /( 706 * + 126 + 1 ). 

26.11 The following are the quarterly index numbers of wholesale prices 
in the U.K. published by the “ Statist ”. 


Year 

1 

Quarter 

2 3 

4 

1928 

122 

125 

118 

117 

9 

119 

114 

114 

109 

1980 

105 

99 

93 

f9 

1 

86 

80 

83 

84 

2 

85 

80 

80 

78 

3 

77 

80 

81 

80 

4 

82 

81 

83 

82 

5 

83 

84 

85 

88 


By a " centred " moving average of four calculate a quarterly index 
corrected for seasonal effects. 
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26.12 If 4 is the " central ” difference defined by 

iui a= 

show that to third differences, 

where ^ [A] stands for a simple moving average of h. 

26.13 Verify equations (26.10), and show generally that the same 
formulae are reached for polynomials of order 2^+1 as for order 2p. 


26.14 The value «e at time t is given by u,=V (t /lO). Sketch the series 
from f=0 to f=100 and show that the " trend ’’ determined by a simple 
moving average is always less than the actual value of the series. 
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27.1 In this chapter we shall consider the short-term and randon^ 
components in time-series, and shall suppose either that our series have 
no trend present (as in Tables 26.2 and 26.3) or that, if trend was originally\ 
present, it has been removed. Our series will then fluctuate more or less \ 
irregularly about some central value which we may regard as the mean \ 
of the whole series ; and our problems are to detect and to invest^ate 
the nature of the components of such fluctuation. 

Tests for randomness 

27.2 Let us first consider what kind of series we are likely to obtain if 
the variation is entirely random, i.e. if successive values are independent 
and the series may be considered as the chance arrangement of a sample 
from some unknown population. Two features suggest themselves as 
natural measures of departure from this situation, (a) the occurrence of 
peaks and troughs in the series and [b) the correlations between neigh- 
bouring members. 

27.3 A member of a series «| is said to be a " peak ” if h,-! <«,> «,+, 
and it is a " trough ’’ if >»<< Uf+i In either case it is a " turning- 
point " and the interval between turning points is called a " phase 
If two or more successive values are the same and are greater than neigh- 
bouring values we regard them as determining on« peak situated in the 
centre of the range of equal values ; and so for troughs. 

It may be shown that in a random series of n terms the mean and 
variance of the number of turning points p are given by 

(»-l) 

These results are independent of the distribution of the parent population 
of values of the series and therefore have a considerable generality. As 
n becomes large the distribution of p tends to normality fairly quickly. 

For large n the average number of turning points per unit interval is 
2/3 and the average phase (the average distance between such p<ants) 
is tiierefore 1 *5. Hence the average distance between peaks (or between 
broughs) is 3, and this is what we expect to find in a random series. 

6)8 
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Example 27.1 

Consider the data of Table 26.2. If n=19 we have, from (27.1) and 
(27.2), a mean value of 11.3 and a variance 3-05 for p. The actual number 
of turning points in the table is 9. The deviation from the mean, 2*3, is 
less than twice the standard deviation of about 1 *75 and we conclude that 
this evidence is not significant of departures of the series from randomness. 

On the other hand in Table 26.4 where n==48 the mean and variance of 
p are 30-67 and 8'21. The observed value of is 14 which differs from 
the mean by more than six times the standard deviation. We cannot 
therefore regard the series as random. 

Serial correlation 

27.4 The coefiBcient of product-moment correlation between the neigh- 
bouring members of a series is called the autocorrelation of order 1 ; and 
similarly the correlation between members (A— 1) apart is called the auto- 
correlation of order k. Thus 




COV («t. Mt+t) 


(27.3) 


V{var M, var 

These functions are very important in the theory of oscillatory time-series 
and have applications far beyond the purpose for which we are now going 
to use them. Where it is important to distinguish between the values 
derived from a parent series and those from a sample we shall call the 
latter serial correlations and denote them by r*. The contrast between 
auto and serial (of Greek and Latin origin), as between p and r, accords 
with our usual practice of denoting parent values by Greek and sample 
values by Latin symbols. 

This usage is not universal. Some writers use “ autocorrelation ” to 
denote the correlation of members of a series among themselves, whether 
in population or in sample, and “ serial ” correlation to denote the correla- 
tions between different series. 


27.5 In a long series var u, and var 
(27.3) becomes 

cov («„ M«+») 

p. 

var Uf 


are practically identical and 
. . . (27.4) 


For short observed series it is better to take the variance of the wh(de 
series (calculated from n terms) as the estimate of var « although the 
covariance is based on only tt —k terms. Similarly it is better to calculate 
the deviations of « from the mean of the whole series in determining the 
product-sum of u, and Ut+^ Then, if the members of the series are., 
measured about the mean of the whole set of terms we then have ' 




n_ 


S («i *»,+») 



n 

£ 

l-l 


. (27.5) 
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Now if a series is random the theoretical value of is zero for any k 
o&er than ^=0. We may thenefore use the departure of the serial 
correlations from zero to test departure of the series from randomness. 
We state without proof that for large » the variance of in a random 
series is approximately 

= ;^k 

Example 27.2 

Table 27.1 shows the values of the residuals of the sheep series of 
Table 26.1 when trend has been eliminated by a simple moving average 
of nines. I 

The value of for this series of 65 terms is 0-595. The standard erroa 
in a random series, from (27.6) is 1 /■v/64 =0-125. The observed valued 
is therefore significant and we conclude that the residual series cannot 
be regarded as a random one. 

TAUiE 27.1.— RaMaal values of Uie slieq> series (rf TaUe 26.1 after eUmlnatkm of 
trend Iqr a stniple nliieipoiat movliig average 


Year 

Residual 

(10,000) 

Year 

Residual 

(10,000) 

Year 

Residual 

(10,000) 

1871 

-176 

1893 

+ 34 

1915 

+ 19 

72 

-112 

94 

-103 

16 

+ 128 

73 

•f 50 

95 

-104 

17 

+ 97 

74 

+141 

96 

- 15 

18 

+ 69 

75 

+ 60 

97 

- 23 

19 

- 29 

78 

- 20 

98 

-f 17 

20 

-174 

77 

+ 12 

99 

+ 71 

21 

-107 

78 

+ 82 

1900 

+ 35 

22 

-142 

79 

+ 130 

01 

+ 16 

23 

-109 

80 

- 14 

02 

- 27 

24 

- 23 

81 

-166 

03 

- 32 

25 

+ 60 

82 

-179 

04 

- 49 

26 

+ 121 

83 

- 84 

05 

- 61 

27 

+ 94 

84 

+ 38 

1 06 

- 52 

28 

- 25 

85 

+ 97 

07 

- 24 

29 

- 90 

86 

+ 8 

08 

+ 68 

30 

- 75 

87 

- 5 

09 

+ I4I 

31 

+ 72 

88 

-105 

10 

+ 119 

32 

+ 152 

89 

- 99 

11 

+ 66 

33 

4112 

90 

+ 35 

12 

- 52 

34^ 

- 64 

9! 

+ 159 

13 

-117 

1 35 

- 87 

92 

+167 

14 

- 61 

1 

1 



The Gsdcnlatton of serial corrdation is raUu^ a tedioas process but 
b«% may be obtained by the following device. Tim sums of s terms 
|b written down vertically on eadi of two dips of paper, the qwdng bdng 

jtqoidattthetwodips. This can very convementlytecknie on a tabulator 

a ^liit heyboflupd. To cak^te the fitrt pcodiict*siini we. pbi the 
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slips so that the first term on the right-hand slip is opposite the second on 
the left-hand slip and so on all the way down. For most series the 
difference of two terms which are opposite can be obtained mentally by 
subtraction, squared and set up on an adding machine. The sum of 
squares of differences is thus determined and the cross product 
derived from the simple identity of the type 

2S:(xy) = +2(y*) 

with the aid of which is obtained without difficulty. 

27.7 Tests of the randomness of a time-series are often unnecessary 
because it is obvious from inspection that the series is systematic to some 
extent. The two tests we have given, however, may be applied when 
there is any doubt and will usually be sufficient to settle it. Suppose 
now that we have decided that our series is not random. Some part 
at least of the oscillatory movement then requires explanation. To set 
up models which will reproduce the behaviour of oscillatory series is one 
of the most difficult outstanding problems of current statistical theory 
and it would be quite beyond the scope of this book to give an account 
of even what is now known, incomplete though that is. What we shall 
do is to describe .and illustrate two techniques, one classical and one new, 
which offer the most promise. 

Periodogram analysis 

27.8 The reader who has an acquaintance with elementary physics is 
probably familiar with the way in which the motion of many oscillatory 
physical phenomena (tides, violin strings, pendulums and so forth) can 
be represented as the sum of a number of ** pure " harmonic waves each 
of which can be represented by a sine or cosine term. The motion of a 

pure oscillator in time is expressible as a term A sin | /) where A is 

\ / ;; 

the wavelength and A the amplitude ; and oscillatory phenomena can 
often be represented by a sum of such terms — 

Ai sin (^oc^+^^t^+A^ sin (27.7) 

Light itself is a phenomenon' of this kind and Newton's classical experiment 
with a prism in splitting white light into a spectrum may be regarded as 
an analysis of a complicated periodic phenomenon into simple terms 
each with its own colour ” or wavelength. 

27*9 Aware that many physical phenomena can be described by series 
of type (27.7), early investigators of economic and meteorological time- 
series were led to inquire whether the same methods could be used to des- 
cribe them. The basic idea was that the series could be regarded as the 
sum of a number of strictly periodic terms pli^, perhaps* an error of 


v 
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otxwrvajtkm. This search for strict periodicity has not been very snccessfnL 
The modd on which it is based requires that, apart from casud errors, the 
peaks and troughs shall recur at equal intervals whereas in economic 
series at least crises certainly do not recur with strict regularity. Furtho:- 
more, the model presupposes that “ errors " behave like errors of observa* 
tkm, that is to say, that they occur to disturb the observation at a particular 
mmnent but do mot affect the subsequent motion of the system. Now 
in economics and meteorology, at least, it is more plausible to suppose 
that when something happens to disturb the system, the effect of that 
disturbance is inflated into the future motion of the system and becomes i 
part of it. The model of superposed harmonics is not therefore a veryj 
plausible one. Nevertheless there are branches of our subject where! 
analysis into harmonic components (i.e. sine or cosine terms) is useful 
and this chapter would be incomplete without some reference to it. 

27.10 The process of searching for the periodicities in a time-series by 
harmonic analysis can be compared to the tuning of a radio set. We 
condate a number of series with known wavelengths with the given 
series and if they are “ out of step ” with the wavdength of the series 
the result is a low intensity ; but when we come into tune with that 
wavelength, there is a high intensity of correlation ; and hence by con- 
sidering the various intensities we can discover whereabouts the true 
wavdength lies. 

To put it more accurately we select a trial wavelength /t and form the 
sums 

A = ^ i u, cos .... ( 27 . 8 ) 


and write 


T> 2 » . 2itj 

B = - S «y sm 
n i-i M 

S*==A*+B* 


. ( 27 . 9 ) 


. ( 27 . 19 ) 


Then 5 is known as the intensity. Apart from constants the numbers 
A and B ate the covariances of the series with the " trial ” sine and cosine 
terms. 

Now suppose that the series is in fact given by 

. 2att , , 

s= n sm ^ -f-o, . 

Ip^ere is a term uncorrelated with the trial period. Hum 
A 2 sin cos — 

« /-I A /t 


S sin cos 


.( 27 . 11 ) 
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where a be 2ir/A, >9 » 2«r//t 


2 { sin {ct—P)i + sin («+>?); } 


a P sin sin {l{a-/?)(n+l)> 

«L . 

sin{i(«+^ )«}sin {i(«+^)v»+l)}~j ^ 

“ sin{i(a+/?)} J • 


with a similar expression for B in which sin {J {ot—fi) {» + 1)} is 
replaced by a cosine. Now for large n this is small unless the tana in 
square brackets is large, that is, unless a— fioia-\-fiis sihalL In 
this case, neglecting the term of order I /n we find 




sin* {i(a — ^)«} 
sin*{i(a-/?)} 


and since, for small 0, sin 0 =^6 approximately, sin{i (a — P)n) = 
\ {a — P)n and we have 

yl* + B* + a* ( 27 . 13 ) 


Thus 5 remains small unless a. is nearly equal to (and hence the trial 
period /t is near to the real period A) in which case 5 is equal to the constant 
a and gives the amplitude of the term. 


27.11 To calculate the sums A and B, suppose in the first place that fi, 
is an integer. Write down the series in rows of /t'thus : 

«> 

“/‘+1 **«» 

**(/»— i>/Hi **(#>— i)/*+i ■ (27.14) 

Totals llh< 

We continue writing down the rows until there axe fewer than ft tesnens ' 
left, the extra terms beii^ neglected. The number Is then as 
we can get to » in multiples of p and may be denoted by N. 

The sum 

|f j % + »»»»cos^ + - - J . , . (27.| 

is then the sum A of (27.8) for N terms. Similaily we have a formnla 
im B wit^ »nes instead of cosines. 

In practice, of course, we do not actually form such a, tipbSe as (27.14). 
The sums may be formed direct from the series <m hn tnurghim* 

by adding every /(ih member, starting in turn at It,, and so on. 
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27.12 The graph of S as ordinate against (i as abscissa gives us a period<h 
gram and the whole process of analysis is known as peiiodogram analysis. 

Example 27.3 

Perhaps the most famous (and certainly the most exhaustive) example 
of a peiiodogram analysis is the one carried out by Lord (then Sir William) 
Beveridge on a series of index-numbers of wheat prices constructed by 
him for a period of about 300 years. Figure 27.1 shows the resulting 
peiiodogram. 

Beveridge worked out the intensities for many trial periods which are 
not integral. The method is the same in essence as that of 27.11. For 
instance, if /t =10/3 we write down the series in rows of 10 and multiply 

^TT 47r 

the sums m^, . . . by cos — , cos — , . . . etc., in forming A. There 

were, in fact, many more trial values for lower values of fi than we have 
been able to show on the diagram. 

The interpretation of a periodogram like this is very difficult. Beveridge 
himself was inclined to attribute significance to 18 or 19 major peaks, and 
was only following the practice of the physical sciences in doing so. It has, 
however, subsequently been shown that three-quarters of the peaks are 
explainable as sampling effects. In fact, it may be shown that if v is 
the variance of the series the chance that S* exceeds 4»x/» in value is 
and hence if q trial periods are picked out at random the chance that one 
at least should exceed 4vk jn is 

l_(l_e~(c)« 

On the basis of this criterion, the peak at /t=l5’25 is significant 
and possibly those at /M=5-1, 12-8, 17-3 and 20-0 are significant, but 
no more. More recent researches on the periodogram for an autoregressive 
series (27.13 below) indicate that it may be smoothed and on this basis 
the peaks at 5-1 and 15 '.25 alone would be significant. But we shall 
have to make these statements without proof and, indeed without adequate 
discussion, merely to warn the student to mistrust most of what he finds 
in the literature on the periodograms of time-series. Different writers 
have been led to claim the existence of cycles of all kinds in economic 
and meteorological data. A reconsideration of the data would probably 
Show that none of these cycles exists in the sense of being strMy periodic, 
at least in economics.*' 

Autoregressive series 

27.13 A more modem approach to the subject attempts to take intP! 
account the point we noted in 27.9, namely that when a disturbance 
occurs it is integrated into the motion of the system. Instead of regarding 

* For some further discussion and tables to facilitate the performance of a periodMmm 
lanalyds see Kendall, Conirilmtions to the Stuiy of OsdBettory Time-Series, 1946, uon* 
^ bridge University 
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onr sj^^ as oscillating like a pendulum (the only departure from 
haumonic motion then being in the errors of observation) we shall consider 
it M swinging like a pendulum subjected to a continual stream of shocks, 
as for instance if it were pelted by small boys at random with peas. The 
pendulum will continue to swing backwards and forwards, but not 
regularly so. The times between its swings will not be constant nor will 
it always swing out to the same extent. In fact it wiU behave very much 
as many oscillatory time-series are seen to behave, which is our main 
justification for introducing this model for study. 

27J4 We shall suppose that the motion of the system is determinedt 
by two factors : («) a group of internal properties such as elasticities and\ 
constraints which determine how the system moves if left to itself and I 
(&) a series of external shocks. We shall further suppose that the existence 
of factors in the first group can be expressed by saying that the value of 
the series at time < is a linear expression in values at previous points of 
time. We shall then have equations such as 

«i+i=A«j+em .... (27.16) 

where /< is a constant and e represents the external disturbance ; and 

«i+j == • • • (27.17) 

where again a and fi are constants. Such series are said to be auto- 
regressive because (27.16) and (27.17) may be regarded as regression 
equations of one term of the series on previous terms. More elaborate 
systems can, of course, be devised but these two simple cases are all we 
shall consider. 
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Fig. 27.3.--€raph of the values of Table 27.3 


TABLE 27.2. — ^Values of series tn+i = 0 *7 «< + 

Where is a random normal variable with zero mean 
From Kendall, 1919, Biomctrika, 36. 267. 


Number of term 

Value of series 

Number of term 

Value of series 

1 

2*390 

21 

0-546 

2 

0*985 

22 

-0-886 

3 

-0*655 

23 

-1*321 

4 

-0*679 

24 

-1*014 

5 

-0*044 

25 

-2*254 

6 

-1*457 

26 

0*582 

7 

-0*731 

27 

0*272 

8 

-0*724 

28 

0*358 

9 

-1*567 

29 

0*981 

10 

-1*654 

30 

-0*497 

11 

-2*416 

31 

-1*078 

12 

-2*821 

32 

-0*318 

13 

-0*701 

33 

-0*597 

14 

-1*515 

34 

1*697 

15 

-2*112 

3$ 

2*585 

16 

-1*602 

36 

0*170 

17 

-1*805 . 

37 

0*497 

18 

-1*624 

38 

0*437 

19 

-1*060 

39 

l*S54 

20 ; 

-.-0*022 

40 

1*474 
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Example 27.4 

To show how series of this kind behave, we give in Table 27;2 and Figure 
27.2 the graph of a series of type (27.16) with/K=0-7, the values of ebeing 
random numbers chosen from a normal population. 

In Table 27.3 and Figure 27.3 we show similarly the graph of a series 
of type (27.17) with a= — l'l,)ff=0’5where eis a random variable chosen 
by selecting random numbers from ra'nge —9*5 to +9-5. 

The irregular occurrence of peaks and troughs in such data is quite 
dear from the diagrams. 


TABLE 27.3. — Values of series u«h- 2 <=° 1-1 — 0-5 «<+ et +2 

Where eii -2 is a rectangular random variable with range— 9-5 to 9-5, rounded off to 

nearest unit 

From Kendall, 1944, Bionutrika, 33, 105. 


Number 
of term 

Value of 
series 

Number 
of term 

Value of 
series 

Number 
of term 

Value of 
series 

1 


23 

- 4 

45 

-13 



24 

- 5 

46 

1 



25 

- 9 

47 

6 



26 

~ 4 

48 

4 



27 

~ 4 

49 

H 



28 

3 

50 

15 



29 

9 

51 

9 

8 

- 1 

30 

4 

52 

8 

9 

10 

31 

- 8 

53 

4 

to 

10 

32 

- 6 

54 

- 1 

11 


33 

- 3 

55 


12 


34 

~ 2 

56 


13 


35 

0 

57 


14 


36 

- 1 

58 


15 


37 

~ 3 

59 


16 


38 

3 

60 


17 


39 

~ 1 

61 

- 5 

IS 


40 

- 8 

62 

-11 

19 


41 

1 ~ 3 

63 

- 8 

20 


42 

! - 8 

64 

- 3 

21 

1 

43 

-10 

65 

5 

22 

- 5 

44 

-16 




27J.5 Consider now the series of (27.16) in the form 


( 27 - 18 ) 

where e has zero mean (and hence so has «) and successive values of e 
ai^ ^dependent. It will be dear from the series that involves r,, e(-i> 
•IfLi hut not etc. Let us then multiply (27.18) by and 
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sum over all values of u. Since cov «(-») = Pu-m • where 
Pk^ is the (A-f ’'^)th autocorrelation, we have 

{Pm-PP*) var « = cov ««-») 

and since the covariance on the right vanishes for ^>— 1 we have 


Pk+l PPk — 0* ^ • 

. (27.19) 

In particular when A = 0 


Pi-=P 

. (27.20) 

and hence 


Pk=p'‘== Pi 

. (27.21) 


We may note from (27.20) that only values of p not greater than unity 
are admissible. If p were greater than one the series would increase in 
amplitude and “ explode ” to infinity. 

27.16 In a like manner, for the series of (27.17) 

««+t+a«*+i+;^«« == ««+• 

we have, on multipl 3 dng by Uf-h summing over u 

P»+*+<V)t+i+M = 0. *>-2 . (27.22) 

In particular, for k= — \, A = 0 we have 


leading to 


cc —0 
Pi+oV>i+fi - 0 

_ PiiklP*) 


(27.23) 

(27.24) 


It may be shown by the theory of finite difference equations (we omit 
the proof) that the solution of (27.22) is 

an 


. (27J25) 
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where 


P — V? with positive sign 
cos 6 = —a 

tan p- ■■ tan 6 


(27.26) 


Here again there are restrictions on the constants a and The latter 
must be positive for p to be real and since cos d is not greater than unity 
a* <4/?, Further, since p* cannot exceed unity p cannot do so. Hence 
fi must be positive and not greater than unity and a must be not greato: 
than 2 in absolute value. If these conditions are not obeyed the series 
wifi .not oscillate within bounds but will diverge to unlimited values. 

' -M'. 


27.17 “tj^ results of 27.15 and 27.16 serve two main purposes. If we 
know that’^he series are of the linear autoregressive type, (27.20), (27.23) 
and (27.24f— and similar equations for more complicated series — enable 
us to estimate the constants p, a and in terms of the autocorrelations 
which, for large samples at least, we may take to be the observed serial 
correlations. Secondly, the laws obeyed by successive autocorrelations 
as exemplified in (27.21) and (27.25) enable us to judge whether given series 
are of the autoregressive type. 


The cortelogram 

27.18 The graph of the autocorrelation p^ as ordinate against k for 
abscissa is called a corrdogratn. Since p-*=Pi we draw it only for non- 
negative values of k. Table 27.4 and Figure 27.4 give the serial correla- 
tions and the correlogram of the sheep data of Table 27.1. There is a 
marked oscillatory movement which may be compared with Figure 27.5, 
giving the correlogram of the artificial series of Table 27.3. 

27.19 Equation (27.21) shows that the theoretical correlogram of a 
series of the autoregressive type (27.18) will be a simple curve deca}ring 
from unity at ^—0 to zero at k= <», the ordinate at each point k faidag 
pt times the ordinate at the previous point. On the other hand equation 
(27.25) shows that the theoretical correlogram of the series (27.17) will 
not o:^y decay according to the factor p but wiU also oscillate. This 
so-called damped harmonic is illustrated in Figure 27.6. 

These theoretical forms, however, are reprodw^ mily approximately 
by series of finite length, as Figure 27.5 illustrati^ The correlbgiam 
oscillates and its earlier terms damp out, but there comes a point wh^ 
no further damping appears. This failme. to damp must be regarded as 
a sampling: e^t. 
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TABLE 27.5.— Serial OQrrdatloiu of the arUflclal series of TaUe 27.3 


Order of 
correlation 
k 


h 


k 


1 

0*70 

11 

-0*05 

21 

005 

2 

0-29 

12 

-0*17 

22 

-0*12 

3 

001 

13 

-0*27 

23 

-0*28 

4 

-017 

14 

-0*31 

24 

-0*43 

5 

-0-27 

15 

-0*30 

25 

-0-57 

6 

-0-25 

16 

-0*18 

26 

-0*56 

7 

-013 

17 

0*12 

27 

-0*26 

8 

0-07 

18 

0*29 

28 

0*02 

9 

012 

19 

0*33 

29 

0*17 

10 

0*05 

20 

0*22 

30 

0*27 



fig. 27 Ar- Cotidofnn the artUdai tcdes of ThUe 27J 
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27.20 Let us return to the scheme of harmonics represented by (27.7). 
It may be shown that for the series 


/ 27T \ 

Af sin 


the correlogram is given by 

+2 var e 


(27.27) 


provided that e is independent of the harmonic terms. 

Thus to any term with amplitude Af in the original series there corre- 
sponds a wave of amplitude Af/(Af+2 var e) in the correlogram which 
is undamped. 

Theoretically, then, the correlogram should give us a method of dis- 
criminating between the scheme of superposed harmonics and the auto- 
regressive scheme. In one case the oscillations in the correlogram do not 
damp out, in the other case they do. In practice, for short series, the 
discriminating power of the correlogram is not very high, owing to the 
failure of autoregressive correlograms to damp out for sampling reasons. 
Nevertheless an examination of the correlogram is often a very good way 
to start an investigation into the generating model of a given system. 

Example 27.5 

Consider again the data of Table 27.4. Taking the obs^ed serial 
correlations as the parent values we have 


fj = 0*595, r* = — 0*151. 


Hence, from (27.23) a (the estimate of «) = — 1-060 
and from (27.24) b{„ „ .,y») = +0-782 
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II the series can be represented by the three term linear autoregressive 
scheme then that scheme is 


««+,—! •060«n.i+0-782«« = 


It is natural to uronder whether a three-term scheme is adequate and 
whether more terms may not be required. The question may be answered 
by the calculation of partial correlations. The following are the partials 
of the present series in our usual notation, 13.2, for instance, denoting 
the correlation between «, and when is constant. 


Order of partial 
correlation 

12 . 

13.2 

14.23 

15.234 

16.2345 


Value of 
partial 

11(1 -f*)-! 

0-595 

0-6460 

-0-782 

0-2509 

0-097 

0-2485 

-0-183 

0-2402 

0-031 

0-2400 


The product 1 —I?* in the last column measures (12J20} the closeness 
of the representation of the series and it is clear that little extra accuracy 
is gained by taking more than three terms, which will account for 75 
per cent of the variation. 


27.21 It may be added that for the purposes of detecting oscillatory 
movements by correlogram analysis " shortness ” is a relative term. Even 
series of 400 terms are sometimes " short ” in the sense that the correlogram 
after the tenth serial correlation or so does not damp out after the manner 
of Figure 27.6. A consideration of the magnitude of the variance of serial 
correlations in a random -series, 1/(m— A), will show why this is so; for 
n—k of the order of 100 the standard deviation is 0’ 1 and values of r as 
great as 0-3 are not impossible. What does appear to be true in practice 
is that even if the amplitude of the oscillations does not decay quickly, 
the swings in the correlogram conform to the period of the generating 
scheme as in Figure 27.4. 

27.22 We conclude the chapter with a brief account of some of the 
IHoperties of the autoregressive schemes of (27.16) and (27.17). Let us 
note as a preliminary point that such schemes will always give an 
approximate representation of the series in the sense that a regresrion 
line will alwajrs approximately represent the data to which it is fitted. 

From relations such as 


U, /(U( , + « 


. (27.28) 
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we see that the series may be regarded as a moving average of infinite 
extent of the series of e’s.* The weights decrease and the contribution 
to Uf of e|_* is proportional to /{*, that is, the contribution of the past 
is less and less important as it becomes more distant, which is what we 
should expect. We have directly from (27.28), when the e’s are in- 
dependent, 

var u, = (!+/**+/<*+ • • • ) var e 

j-Z^sVar e, . (27.29) 

expressing the variance of the series in terms of that of the disturbance 
function e. If ft is near unity the variance of u may be much larger than 
that of e. 

27.23 In a similar manner it may be shown that for the three-term 
series (27.17) the solution of Mj, apart from terms which will have damped 
out of existence if the series was begun a long time ago, is also a moving 
average of the e’s and is given by 

«( = S (27.30) 


These weights are themselves oscillating and damped, like the correlogram. 
It may also be shown that 


var u 1 -f-yCf 

var e ~ (1 -/?) {(1 ■fy?)*-a*} ’ 


(27.32) 


which reduces to (27.29) when a=/t, fi—0 as it should. 

Example 27.6 

In Example 27.5 we found for estimates of a and yff the values of- —1 -060 
and -f-0’782 respectively. Substitution in (27.32) gives 


var« = 3-778 vare 


Thus of the total variation of the series var e represents about 1 jS'TTB 
or 26 per cent, which agrees with the estimate given by 1 — jR* in Example 
27-5 within one per cent. 


•The series of (27.28), to be a complete aolutioa, should have added to it a term 
Afit where d is an arUtraiy ccmstant. We euppose, however, that the smies began a 
long time ago so that ihia term has damped out of mdstmce, being less than unity. 
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Ea&mfie 27.7 

The sunspot data of Table 26.4 are an extract from a larger series 
beginning in 1749. An analysis of the series of 176 terms ending in 1924 
Yule, PkU. Trans., A, 226, 267 gave the following — 


a = -1-342, 6 = +0-655 


Partial correlations indicated that this series was adequately represented 
(about 80 per cent) by a three-term autoregressive scheme and no im- 
{Huvement would given by further terms. Thus it appears th at the 
snies can be regarded as autoregressive with a damping factor VO -655 
«0-81 approximately. The period in the correlogram (d of equation 
(27.26) is given by 


« 1-342 

^ ~ 2V0-655 


0-829 


giving 33° approximately. Thus the period of the correlogram is 
360/33=10-6 years. The series itself has no single *' period” because 
the interval between successive peaks and troughs varies. 


The ** period ” of an oscillation 

27.24 From what we have said above it will be clear that for auto- 
regressive schemes we cannot speak of the period of the series. There 
will be one period in the correlogram for the three-term case of (27.17) — 
or more with more elaborate schemes — and perhaps we might call this 
the autoregressive period. But it does not necessarily correspond to 
the mean-distance between peaks in the series itself and in any case the 
distances between peaks vary. The same is true of the distances between 
" upcrosses ” or " downcrosses ”, namely points where the series (measured 
from its mean) change sign from negative to positive or vice-versa. 

The autoregressive period of (27.17) is given by 2nl$ where as in (27.26) 

cos d = —a I2y/fi .... (27.33) 
Now consider the series of values. 


yt = 

We have, ^ce the mean values of x and y are zero 


. (27.34) 


var X, =.var i»H-i+var 2 cov («i+i, 

= 2 var u (1 — pj) 

“Vary, 

aaid 

= cov («„ i»,+J+var »h-i 
— cov (ut+i, «i+^— cov («,+,, nil 
«var*»{l-%+p,) 
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Hence for r, say, the correlation between and we have 


l-2pi+p, 

2(1 -P,) 


(27.35) 


Now suppose that x and y are normally distributed, as will be the case 
if « is normal. The relative frequency with which * and y are positive 
(i.e. and so that «,+i is a peak) is then the relative 

frequency in a bivariate normal distribution in the positive cell among 
the four into which it is divided by *=0 and y =0. This, by Sheppard’s 
theorem (Exercise 10.4) is given by / where 


so that 


T = cos (1— 2/)7r 
= —cos 2nf 

I If = 2n /cos-^(— t) 


. (27.36) 


and this gives us the mean distance between peaks. 


27.25 For the autoregressive scheme (27.17) we have in virtue of (27.23) 
and (27.24) 


r = |(l+a_^) 

and thus 

27T 

mean distance (peaks) =- 

which is not the same as (27.33). 

Example 27.8 

Consider a series for which a=— I'l, yff=0-5. From (27.38) we find 
for the mean distance between peaks 

r = j(l-M-f()-5)=--0-3 

cos-H)-3 = 72-54“, 1 // = 4-96. 

In a series of 480 terms constructed according to this formula Kendall 
(/. Boy. Statist. Soc., 1945, 108, 93) found an observed value of 5 '05, in 
excellent agreement. 

On the other hand for the autoregressive period, from (27.33) 
cos 5 = M /2V0-5 = 0-7778, 6 = 38.9" 
giving for the autoregressive period 360/38-9=9-3 units. 

27.26 Two final comments ; 

(a) We have emphasised that for certain types of oscillatory series 
the idea of a single period or set of periods in the strict sense may be 
inappropriate. The student who isr interested in oscillatory movements 
shotdd accustom himself to think of the distribution of distances between 




. (27.37) 

. (27.38) 
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peaks or apcrosses as expressing its osdllatory behaviour, in the same 
way that he thinks about a distribution of frequencies as characterising 
a population ; 

(6) For series of such a type the existence of the random variable e 
means that there is a limit to the accuracy with which we can predict 
the behaviour of the series. The autoregressive scheme will accoimt, 
at least approximately, for a certain amoimt of systematic movement 
expressible in terms of the constants of the scheme ; and hence, given 
previous members of the series we can predict the next member except 
for the random element. The latter, though we may estimate its 
variance, is itself unpredictable and there is thus an essential element 
of uncertainty in any forecast of the future. 


SUMMARY 

1. Randomness in an oscillatory time series may conveniently be 
tested by ascertaining the number of turning points which, in a random 
series of « terms, has a mean value of f (n— 2) and a variance of (16»— 
29) /90. 

2. Alternatively, a test may be made of the first serial correlation which 
has a variance of 1 /(o— 1) in random series. 

3. The coefficient of product-moment correlation between members of 
a series (A— 1) members apart is called the autocorrelation (for infinite 
series) or the serial correlation (for observed series) of order k. 

4. The graph of the serial correlation as ordinate against the order k 
as abscissa is called the correlogram of the series. 

5. For series which may be regarded as composed of a series of harmonic 
terms, a technique known as periodogram andysis may be used to isolate 
the periodic terms. 

6. A smes in which the value at any point is a function of values at 
.previous points plus a disturbance is said to be autoregressive ; and if 
the function is linear is linearly autoregressive. The two most important 
cases are — 


*<+* 

f,_ Hie correlogram offers a means of descriminating between the 
hatamnic series and the autoregressive series. 

fi, An autoregressive s»ies has no period in the strict sense. The 
aom-ffistance tetween peaks may be quite different fnm the puiod of 
fl» cotrelogiam. 



659 


TIME- SERIES (2) 

EXERCISES 

27.1 The following table shows the deviations from a moving nine-year 
average of potato yields in England and Wales for the years 1888-1935 
(units are -^-th ton) — 


Year 

Yield 

Year 

Yield 

Year 

Yield 

Year 

Yield 

1888 

- 6 

1900 

- 7 

1912 

-15 

1924 

- 1 

89 

+ 2 

01 

+ 6 

13 


25 

+ 2 

90 

- 4 

02 

- 3 

14 


26 

- 9 

91 

- 3 

03 

- 7 

15 


27 

- 3 

92 

- 1 

04 

-f 2 

16 


28 

+ 9 

93 

+ 6 

05 


17 


29 

+ s 

94 

- 2 

06 

+ 1 

18 


30 

+ 1 

95 

+ 7 

07 

- 7 

19 


31 


96 

+ 3 

08 

+ 8 

20 


32 

+ 1 

97 

- 6 

09 

+ 4 

21 

- 9 

33 

+ 2 • 

98 

+ 2 

10 

-f 3 

22 

+11 

34 

+ 5 

99 

0 

11 

+ 4 

23 

- 1 

35 

- 4 


Find the number of turning points and show that it does not differ 
significantly from what would be expected of a random series. 

27.2 From (27.35) derive an expression for the mean-distance between 
peaks in a series of t)rpe (27.16) in the form 

2jr/cos-Hi{;‘-l)} 


Consider the case when fi=0. 

27.3 In an autoregressive series of type (27.17) find the mean-distances 
between peaks for the following values of a and /?. 


a 


~1*5 

0*8 

-1*0 

0*6 

-0*8 

0*8 


Find also the autoregressive periods. 

27.4 Show that the ^th auto-correlation of the first difference of a 
series with autocorrelations is given by 




2(1 -Pi) 


27.5 In a series of type (27.17) the observed fj was 0*850 and the 
obseryed I'lSsO'OOS. ^timate a atul ft. 
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27.6 Two series « and «' are added together so that a new series is formed 

by If ** and «' are independent show that the ^th auto- 

correlation of V is given by 

Pt var u -f p*' var «' 
var u 4- var «' 

where p and p' refer to the autocorrelations of « and «' respectively. 

27.7 By considering the joint variation of u, and Uf+j show that the 
mean-distance between upcrosses in a series is 2n^/cos~V>i where Pi is the 
first autocorrelation. Find the expression in terms of a and ji for series 
of type (27.17). 

27.8 The following are the serial correlations of the Beveridge series 
referred to in Example 27.3. Draw the correlogram and compare any 
periods which it suggests to you with the results of that example. 


Order of 
correla- 
tion h 


B 

n 

B 

n 

B 


m 

0-562 

16 

0-158 

31 


46 

-0-036 


0-103 

17 

0-109 

32 


47 

-0-013 



18 

0-002 

33 


48 

0-042 



19 

-0-075 

34 

0-007 

49 

0-062 

5 


20 

-0-062 

35 

0-056 

50 

0-065 

6 

-0-136 

21 

-0-021 

36 

0-010 

51 

0-050 

7 

-0-211 

22 

-0-062 

37 

-0-004 

52 

0-009 

8 

-0-261 

23 

-0-088 

38 

-0-015 

53 

-0-027 

9 

-0-192 

24 

1 -0-084 

39 

-0-047 

54 

-0-053 

10 

-0-070 

25 

-0-076 

40 

-0-047 

55 

-0-073 

11 

-0-003 

26 

-0-091 

41 

0-008 

56 

-0-106 

12 

-0-015 

27 

-0-052 

42 

0-034 

57 

-0-084 

13 

-0-012 

28 

-0-032 

43 

0-065 

58 

-0-019 

14 

0-047 

29 

-0-012 

44 

0-099 

59 

0-003 

15 

0-101 

30 

0-059 

45 

0-009 

60 

0-010 


27.9 For the autoregressive series of type (27.17) show that 


and hence that 14-a4-/l? is not negative. 

Show that the variance of the mean of n consecutive terms of the 


series is 


var u 


(1+A) 


where, for large n, A is given by 

__ 2(a+fi+^)_ 

(l+y?) (l+a+>j 

l+a+fi 
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Hence show that A is negative if is less than fi, and thus that in some 
circumstances the mean of n consecutive values can have a smaller variance 
than the mean of n values chosen at random. 

27.10 A Spencer 21 -point smoothing formula (26.20) is applied to a 
random series. Find the autocorrelations of the resulting series and 
sketch the correlogram. 

27.11 In the autoregressive series of type (27.17) consider the case when 
P=\. Show that the series then becomes undamped and the correlogram 
reduces to a simple harmonic. 
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APPENDIX TABLE 1 
Nomul cum 


Ordinates of the Normal Curve 


‘ V(^ g ) *~*** Second Differences 


■ 

H 

A‘(-) 

A* 

X 

y 

A‘(-) 

A* 

0-0 

0-39894 

199 


2*5 

0-01753 

395 

+ 79 

01 

•39695 

591 


2-6 

•01358 

316 

+ 66 

0-2 

•39104 

965 

- 347 

2-7 

•01042 

250 

+ 53 

0-3 

-38139 

1312 

-- 308 

2-8 

•00792 

197 

+ 45 

0*4 

•36827 

1620 

- 265 

2-9 

•00595 

152 

•f 36 

0-5 

•35207 

1885 

- 212 

80 

•00443 

116 

+ 27 

0-6 

•33322 

2097 

- 159 

31 

•00327 

89 

+ 23 

0-7 

•31225 

2256 

104 

3-2 

•00238 

66 

■f 17 

0-8 

*28969 

2360 

- 52 

3-3 

•00172 

49 

+ 13 

0-9 

•26609 

2412 

0 

3-4 

•00123 

36 

4* 10 

1-0 

•24197 

2412 

+ 46 

3-5 

•00087 

26 

+ 7 

M 

•21785 

2366 

+ 84 

3-6 

•00061 

19 

+ 6 

1-2 

•19419 

2282 

+ 118 

3-7 

HmW 

13 

-f 4 

1*3 

•17137 

2164 

+ 143 

3-8 


9 

+ 2 

1-4 

•14973 

2021 

+ 161 

3-9 


7 

+ 3 

1-5 

•12952 

1860 

+ 173 

4-0 

•00013 

4 

mtimm 

1-6 

•11092 

1687 

+ 177 

4-1 

•00009 

3 

mmm 

1-7 

•09405 

1510 

-f 177 


•00006 

2 


1-8 

•07895 

1333 

-f 170 

mim 

•00004 

2 


1-9 

•06562 

1163 

+ 162 

la 

•00002 

— 

— 

20 

•05399 

1001 

+ 150 

IH 

•00002 



21 

•04398 

851 

+ 137 

4-6 

•00001 



22 

•03547 

714 

+ 120 

mSm 

•00001 


T-r-n 

2-3 

•02833 

594 

+ 108 

KM 

•00000 



2-4 

•02239 

486 

I + 91 

H 





Predsim of JnlN^ofsffon.— Owing to the magnitude of the second diff er e ncei, 
simple interpolation near the b^finning of the table may give an error up to S in tihe 
fourth, place ; the use of second differences will bring this down to 1 or 2 in (he lastpiace, 
tihird (differences being small. Where third differences nre greatest, in the neighbourhood 
of s/rW0<6, the error may be as large as 3 in the last place unless the Hmd difference 
is used. 
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APPENDIX TABLE 2 

Areas under the normal curve (Probaidlity function of the normal distribution) 
The table shows the area of the curve lying to the left of 

specified deviates x ; e.g, the area corresponding to a deviate 1 *86 (*»! *5+ 0‘36, 
is 0-9686. 


Deviate 

0-0+ 

0-5 + 

1-0+ 

1-5+ 

2-0+ 

2-5+ 

3-0+ 

3-5+ 

0-00 

5000 

6915 

8413 

9332 

9772 

9*379 

9*865 

9*77 

0-01 

5040 

6950 

8438 

9345 

9778 

9*396 

9*869 

9*78 

0-02 

5080 

6985 

8461 

9357 

9783 

.9H13 

9*874 

9*78 

0-03 

5120 

7019 

8485 

9370 

9788 

9*430 

9*878 

9*79 

0-04 

5160 

7054 

8508 

9382 

9793 

9*446 

9*882 

9*80 

0-05 

5199 

7088 

8531 

9394 

9798 

9*461 

9*886 

9*81 

006 

5239 

7123 

8554 

9406 

9803 

^71 

9*889 

9*81 

0-07 

5279 

7157 

8577 

9418 

9808 

9*492 

9*893 

9*82 

0-08 

5319 

7190 

8599 

9429 

9812 

9*506 

9*897 

9*83 

0-09 

5359 

7224 

8621 

9441 

9817 

9*520 

9*900 

9*83 

0-10 

5398 

7257 

8643 

9452 

9821 

9*534 

9*03 

9*84 

0*11 

5438 

7291 

8665 

9463 

9826 

9*547 

9*06 

9*85 

0-12 

5478 

7324 

8686 

9474 

9830 

9*560 

9*10 

9*85 

0-13 

5517 

7357 

8708 

9484 

9834 

9*573 

9*13 

9*86 

0-14 

5557 

7389 

8729 

9495 

9838 

9*585 

9*16 

9*86 

0-15 

5596 

7422 

8749 

9505 

9842 

9*598 

9*18 

9*87 

0-16 

5636 

7454 

8770 

9515 

9846 

OIAAQ 

9*21 

9*87 

0-17 

5675 

7486 

8790 

9525 

9850 

9*621 

9*24 

9*88 

0-18 

5714 

7517 

8810 

9535 

9854 

9*632 

9*26 

9*88 

0*19 

5753 

7549 

8830 

9545 

9857- 

9*643 

9*29 

9*89 

0-20 

5793 

7580 

8849 

9554 

98il’ 

9*653 

9*31 

9*89 

0-21 

5832 

7611 


9564 

9864 

9*664 

9*34 

9*90 

0-22 

5871 

7642 

8888 

9573 

9868 

9*674 

9*36 

9*90 

0-23 

5910 

7673 

8907 

9582 

9871 

9*683 

9*38 

9*04 

0-24 

5948 

7704 

8925 

9591 

9875 


9*40 

9*08 

0-25 

5987 

7738 

8944 

9599 

9878 

9*702 

9*42 

9*12 

0-26 

6026 

7764 

8962 

9608 

9881 

9*711 

9*44 

9*15 

0-27 

6064 

7794 

8980 

9616 

9884 

9*720 

9*46 

9*18 

0-28 

6103 

7823 

8997 

9625 

9887 

9*728 

9*48 

9*22 

0-29 

6141 

7852 

9015 

9^3 

9890 

9*736 

9*50 

9*25 

0*30 

6179 

7881 

9032 

9641 

9893 

9*744 

9*52 

9*28 

0-31 

6217 

7910 

9049 

9649 


9*752 

9*53 

9*31 

0-32 

6255 

7939 

9066 

9656 

9898 

9*760 

9*55 

9*33 

0-33 

6293 

7967 

9082 

9664 

9901 

9*767 

9*57 

9*36 

0-34 

6331 

7995 


9671 

9904 

9*774 

9*58 

9*39 

0-35 

6368 

8023 

9115 

9678 

OQHA 

SySnJO 

9*781 

9*60 

9*41 

0-36 

6406 

8051 

9131 

9686 

QQfiQ 

9*788 

9*61 

9*43 

0-37 

6443 

8078 

9147 


9911 

9*795 

9*62 

9*46 

0-38 

6480 

8106 

9162 

9099 

9913 

9*801 

CMAi. 

9*48 

0-39 

6517 

8133 

9177 

9706 

9916 

9*807 

9*65 

9*50 

0-40 

6554 

8159 

9192 

9713 

9918 

9*813 


9*52 

0-41 

6591 

8186 

9207 

9719 

9920 

9*819 


9*34 

0-42 

6628 

8212 

9222 

9726 

9922 

9*825 



0-43 

6664 

8238 

9236 

9732 

9925 

9*831 

9*70 


0-44 

6700 

8264 

9251 

9738 

9927 

9*836 

9*71 

9*59 

0*45 

6736 

8289 

9265 

9744 

9929 

9*841 

9*72 

9W1 

0-46 

6772 

8315 

9279 

9750 

9931 

9*846 

9*73 

9*^ 

0-47 


8340 

9292 

9756 

9932 

9*851 

9*74 

iXULA 

sroe 

0-48 

6844 

8365 

9306 

9761 

9934 

9*856 

9*75 

9*66 

0-4^ 

6879 

8389 

9319 

9767 

9936 

smx 

9*76 

9*67 


. I?«*SHD**“* iB body of ttw tab)* a» onittod. lU^tod «*• •» 

.iaml^,prpoma,»4.tni ataaiBfptO-mUti. 



Significance points of x* 

Bxprodaeed from Table III of R. A. Fisher's Statisticai Methods for Research Workers, Oliver & Boyd, Ltd., Edinburgh, by pomission of the author and publishers 
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—For values of v greats than 30 the quantity may be taken to be distributed normally about mean V(2i^— 1) with 
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An>ENDDl 

— ^The Propfortion of the Area of the Curve ya of Unit Area lying to 

Hr 

0 to 6, and for values 


(Condensed to three figures from the four>figure tables by ** Student ** in MHron, vob 5, 1925, uul pubUsbed 


jlHIII 

isi 

2 

3 

4 

5 

6 

7 

8 

9 

10 

■Si 

0-500 

0*500 

0*500 

0-500 , 

0-500 

0-500 

0*500 

0*500 

0-500 

0*500 

■SB 

*532 

•535 

•537 

•537 

•538 

•538 

538 

•539 

•539 

•539 

•2 

•563 

•570 

•573 

•574 

•575 

•576 

•576 

•577 

•577 

•577 

-3 

*593 

•604 

•608 

•610 j 

•612 

•613 

*614 

•614 

•614* 

•615 

•4 

*621 

•636 

•642 

•645 , 

•647 

*648* 

•649* 

•650 

•651 

•651 

•5 

*648 

•667 

•674 

•678 

•681 

•683 

•684 

•685 

•685* 

•686 

•6 

*672 

•695 

•705 

•710 

•713 

*715 

•716 

•717 

•718 

•719 

•7 

*694 

•722 

•733 

•739 

•742 

•745 

•747 

•748 

•749 

•750 

8 

*715 

•746 

•759 

•766 

*770 

•773 

•775 

•777 

•778 

•779 


•733 

•768 

•783 

•790» 

•795 

•799 

•801 

•803 

•804 

•80S 

mSm 

•750 

•789 

•804» 

•813 

•818 

•822 

•825 

•827 

•828 

•830 

Bn 

*765 

•807 

•824 

•833* 

•839 

•843 

•846 

•848 

•850 

*851 

mSm 

•779 

•823* 

•842 

•852 

•858 

•862 

•865 

•868 

•870 

•871 

■SB 

•791 

*838 

•858 

•868 

•875 

•879 

•883 

•885 

•887 

•889 

1-4 

•803 

•852 

•872 

•883 

•890 

•894* 

•898 

•900 

•902* 

•904 

1-5 

•813 

•864 

•885 

•896 

•903 

•908 

•911 

•914 

•916 

•918 

1-6 

*^2 

*875 

•896 

•908 

•915 

•920 

•923 

•926 

•928 

-930 

1-7 

•831 

•884 

•906 

•918 

•925 

-930 

•933* 

•936 

•938 

•940 

1*8 

•839 

•893 

•915 

•927 

•934 

•939 

•943 

•945 

•947 

•949 

■SB 

•846 

•901 

*923 

•935 

•942 

•947 

•950 

•953 

•955 

*957 

BAB 

•852 

•908 

*930 

•942 

•949 

•954 

•957 

•960 

•962 

•963 

21 

*858« 

-915 

•937 

•948 

•955 

•960 

•963 

•965* 

■967 

a QUO 
• 200X0 

2*2 

•864 

•921 

•942 

•954 

•960* 

•965 

•968 

•970* 

•972 

•974 

2-3 

*869* 

*926 

•947» 

•958* 

•965 

• QAO 
9020 

•972* 

•975 

•976* 

•978 

2*4 

•874 

*931 

*952 

•963 

• SfD«7 

•973 

•976 

•978 

•980 

•981 

2*5 

*879 

-935 

-956 

•967 

•973 

•977 

•979* 

•981* 

*983 

•984 

2*6 

•883 

*939 

•960 

•970 

*976 

•980 

•982 

•984 

•966 

•987 

2*7 

*887 

*943 

*963 

•973 

•979 

•982 

•985 

•986* 

•988 

•989 

2*8 

*891 

•946 

• OCA 

•976 

•981 

•984 

•987 

•988 

•990 

*991 

2*9 

■804 

•949 

• uov 

•978 

-983 

•986 

•988* 

•990 

•991 

•992 

3*0 

*898 

•952 

•971 

•980 

•985 

•988 

•990 

•991* 

•992* 

•993 

3*1 

•901 

*955 

•973 

•982 

•987 

•989 

•991 

•993 

•994 

0 QQA 
• 2f«0^ 

3*2 

*904 

•957 

•975 

•983* 

•988 

•991 

•992* 

•994 

•995 

•995 

3*3 

*906 

•960 

•977 

•985 

•989 

•992 

•993 

•995 

•995 

•mro 

3-4 

*909 

•962 

•979 

•986 

•990 

•993 

•994 

•995 

• S0C0O 

•997 

3*5 

*911 

4(UIA 

•980 

•988 

•991 

•994 

•995 

* 2090 

•997 

•997 

3*6 

*914 

*965 

•982 

•989 

•992 

•994 

*20200 

•20200^ 

•997 

•998 

3*7 

•916 

*967 

*983 

•990 

*993 

•995 

80200 

•997 

•997* 

•998 

3*8 

•918 

•969 

*984 

•990 

•994 

•995* 

•997 

*997 

20200 

•998 

3*9 

*920 

*970 

•985 

•991 

•994 

•9200 

*997 

•998 

•908 

•998* 

4*0 

•922 

*971 

•986 

•992 

•995 

• 20200 

•997 

•998 

•998 

.mm 

4*1 

•924 

*973 

•987 

*993 

•995 

•997 

998 

•998 

•900 

•999 

4*2 

•926 

•974 

•988 

•993 


•997 

•998 

•998* 

.mm 

•lf0F?r 

.999 

4*3 

*927 

•975 

•988 

•994 

•996 

•997* 

•998 

-099 

• WO 

.mm 

• sfW 

4*4 

•929 

*976 

*989 

•994 

• CWD 

•998 

•998 

•999 

•998 

•099 

4*5 

*930 

*wrj 

•990 

•995 

•997 

•Wo 


.nmi 

•mni 

•999 

•009 

4*6 

•932 

•978 

*990 

•995 

•997 

*998 

« mm 

. mm 
•swv 

■ 

• sfWBf 

•099* 


*933 

*979 

•991 

•995 

•097 

•998 


•9W 

.mm 

•WIRf 

1*000 

4*8 

•935 

*080 

•991 

•mfo 

.*998 

•998* 

•999 

. mm 

• WXr 

•999* 


4*9 

•936 

*980 

•902 

.OUA 

•998 

•099 

.mm 

•999 

1000 


5»0 

•937 

*981 

*902 

• SWFv 



.mm 

.mms 



5*1 

*938 

*982 

•903 

■ 

•998 

.AACk 

>999 

.mms 
• sRW* 



5*2 

•939* 

*982» 

*093 

•007 

• 2090 

.aon 

•SfW 

.mm 

• SFSW 

1>000 



5*3 

^ *941 

*983 

*993 

•097 

•998 

•999 

•999 




5*4 

•942 

*984 

*004 

•007 

•991^ 

*10209 

. mm* 




5*5 

•943 

*984 

•004 

•097 

•999 

•999 

•099* 




5 -e 

*944 

*985 

*994 

•997* 

•999 

*909 

1*000 





*945 

*98$ 

•995 

•998 

^099 

eOOO 
• 2099 





3-5 

*946 

*986 

•99S 

•WWI 

•000 

aOCiO 

*999 






-947 

*966 

•00$ 

*006 

•000 







0 * 94 t 

0^967 

0*005 

0*006 

O -000 

a-SM* 
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667. 

TABLE 4 

the Left ot the Ordinate of Deviation t, for values of t proceeding by intervals of 0 • 1 from 

of V from 1 td 20. 


by penaission of MHron and the late W. S. Cosset, who supplied a few conectioiis to the original tables) 


i 

11 

12 

13 

14 

15 

16 

17 

18 

19 

20 

0 

0*500 

0*500 

0-500 

0-500 

*5000 

0*500 

0-500 

0-500 

0-500 

0-500 

0-1 

•539 

•539 

•539 

•539 

*5.39 

•539 

•539 

•539 

•539 

•539 

•2 

•577 

•578 

•578 

•578 

•578 

•578 

•578 

•578 

•578 

•578 

*3 

•615 

•615 

•615* 

•616 

•616 

•616 

•616 

•616 

•616 

•616 

•4 

•652 

•652 

•652 

•652 

•653 

•653 

•653 

•653 

•653 

•653 

•5 

•686* 

•687 

•687 

•688 

'688 

•688 

•688 

•688 

•689 

*689 

•6 

•720 

•720 

•721 

•721 

•721 

•721* 

•722 

•722 

•722 

•722 

•7 

•751 

•751 

•752 

•752 

•753 

•753 

•753 

•754 

•754 

•754 

•8 

*780 

•780 

•781 

•781* 

•782 

•782 

•783 

•783 

•783 

<783 

•9 

•806 

•807 

•808 

•808 

•809 

•809 

•810 

•810 

•810 

•811 

1*0 

•831 

•831* 

•832 

•833 

•833 

•834 

•834 

•835 

•835 

•835 

M 

•853 

•853* 

•854 

•855 

•856 

•856 

•857 

•857 

•857* 

•858 

1*2 

•872 

*873 

•874 

•875 

•876 

•876 

•877 

•877 

•878 

•878 

1-3 

•890 

•891 

•892 

•893 

•893 

•894 

•894* 

•895 

•895 

•896 

1*4 

•905‘ 

•907 

•907* 

•908 

•509 

•910 

*910 

•911 

•911 

•912 

1*5 

•919 

•920 

•921 

•922 

•923 

•923* 

•924 

•924* 

•925 

•925 

16 

•931 

*932 

•933 

•934 

•935 

•935 

•936 

•936* 

•937 

•937 

1*7 

•941 

•943 

•943* 

•944 

•945 

•946 

•946 

•947 

•947 

•948 

1*8 

•950 

•951* 

•952* 

•953 

•954 

•955 

•955 

•956 

•956 

•956* 

1*9 

•958 

•959 

•960 

•961 

•962 

•962 

•963 

•963 

•964 

•964 

2*0 

•965 

•966 

•967 

•967 

•968 

•969 

•969 

•970 

•970 

•970 

21 

•970 

•971 

•972 

•973 

•973* 

•974 

•974* 

•975 

•975 

•976 

2-2 

•975 

*976 

•977 

•977 

*978 

•979 

•979 

•979 

•980 

•980 

2-3 

•979 

•980 

•981 

•981 

•982 

•982 

•983 

•983 

•983* 

•984 

2-4 

•982 

*983 

•984 

•985 

•985 

•985* 

•986 

•986 

•987 

•987 

2*5 

•985 

•986 

•987 

•987 

•988 

•988 

•988* 

•989 

•989 

•989 

2*6 

•988 

•988 

•989 

•989* 

•990 

-990 

•991 

•991 

•991 

•991 

2*7 

•990 

•990 

•991 

•991 

•992 

•992 

•992 

•993 

•993 

•993 

2*8 

•991 

•992 

•992* 

•fH)3 

-993 

•994 

•994 

•994 

•994 

•994* 

2*9 

*993 

•993 

•994 

•994 

•994* 

•994* 

•995 

•995 

•995 

•996 

3*0 

•994 

•994* 

•995 

•995 

•995* 

•996 

•996 

•996 

•996 

•eswr 

3*1 

*995 

•995 

•996 

•996 

•996 

•997 

•997 

•997 

•997 

•997 

3*2 

•996 

*996 

•996* 

•997 

•997 

•997 

•997 

•997* 

•998 

•998 

3*3 

•996‘ 

•997 

•997 

•997 

•998 

•998 

•998 

•998 

•998 

*998 

3*4 

•997 

*997 

•998 

*998 

•998 

•998 

•998 

•998 

•998* 

•999 

3*5 

•997* 

•998 

•998 

•998 

•998 

•998* 

* ocfy 

•999 

•999 

•999 

3*6 

•998 

•998 

•998 

•999 

•999 

•999 

•999 

•999 



3*7 

•998 

•998* 

•999 

•999 

•999 

•999 

•999 


..OOQ 

♦stw 

•fiRfy 

3*8 

•998* 

•999 

•999 

• 575 #!# 

•999 

* 

•sw# 

.QOQ 

•999 

•999 

3*9 

•999 

•999 

•999 


•999 

•999 

•999 

•999* 


1000 

4‘0 

.000 

•999 

•999 

•999 

•999 

•999* 

.oocw 

•5W5f^ 

1-000 

1000 


4*1 

•999 

•999 

•999 

•999* 

•999* 

1-000 

1-000 




4*2 

•999 

•999 

•999* 

1-000 

1-000 






4*3 

•999 

•999* 

1-000 








4*4 

.QOCI 

1-000 









4*5 

£%r\r\K 










4*6 

1-000 











Note . — ^The significance points of / for values of v greater than 20 can be derived by 
taking the square-root of F (Table 5) for Vi^l, bearing in nund that 
cent point of F corresponds to a value of 1 — Jjr/100 in the above table. In the above 

table a small* terminal • means that the original ik»nr-6gure tables from which these 
were compiled ended in a ^ 
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APPfiNIMX TABLE 5 — l^niBouice potnts of tbe variance-ratio P 


A. 5 per cent points 

Reproduced from Fidlker and Yates : StaHstical Tahiti for Bi(^gical, Medical and AmcuUural R&earck, 
Oliver and Boyd Ltd., Edinbuigh, by permission of the authors and publishers 


B 

1 

2 

3 

4 

5 

6 

8 

12 

24 

00 


161*4 

199*5 

215*7 

224*6 

230*2 

234*0 

238*9 

243*9 

249*0 

254*3 


18*51 

19*00 

19*16 

19*25 

19*30 

19*33 

19*37 

19*41 

19*45 

19*50 


10*13 

9*55 

9*28 

9*12 

9*01 

8*94 

8*84 

8*74 

8*64 

8*53 1 


7*71 

6*94 

6*59 

6*39 

6*26 

6*16 

6*04 

5*91 

5*77 

5*63 

5 

6*61 

5*79 

5*41 

519 

5*05 

4*95 

4*82 

4*68 

4*53 

4*36 

6 

5*99 

5*14 

4*76 

4*53 

4*39 

4*28 

4*15 

4*00 

3*84 

3*67 

7 

5*59 

4*74 

4*35 

4*12 

3*97 

3*87 

3*73 

3*57 

3*41 

3*23 

8 

5*32 

4*46 

4*07 

3*84 

3*69 

3*58 

3*44 

3*28 

3*12 

2*93 

9 

5*12 

4*26 

3*86 

3*63 

3*48 

3*37 

3*23 

3*07 

2*90 

2*71 

10 

4*96 

4*10 

3*71 

3*48 

3*33 

3*22 

3*07 

2*91 

2*74 

2*54 

11 

4*84 

3*98 

3*59 

3*36 

3*20 

3*09 

2*95 

2*79 

2*61 

2*40 

12 

4*75 

3*88 

3*49 

3*26 

3*11 

3*00 

2*85 

2*69 

2*50 

2*30 

13 

4*67 

3*80 

3*41 

3*18 

3*02 

2*92 

2*77 

2*60 

2*42 

2*21 

14 

4*60 

3*74 

3*34 

3*11 

2*96 

2*85 

2*70 

2*53 

2*35 

2*13 

15 

4*54 

3*68 

3*29 

3*06 

2*90 

2*79 

2*64 

2*48 

2*29 

2*07 

16 

4*49 

3*63 

3*24 

3*01 

2*85 

2*74 

2*59 

2*42 

2*24 

2*01 

17 

4*45 

3*59 

3*20 

2*96 

2*81 

2*70 

2*55 

2*38 

2*19 

1*96 

18 

4*41 

3*55 

3*16 

2*93 

2*77 

2*66 

2'51 

2*34 

2*15 

1*92 

19 

4*38 

3*52 

3*13 

2*90 

2*74 

2*63 

2*48 

2*31 

2*11 

1*88 

20 

4*35 

3*49 

3*10 

2*87 

2*71 

2*60 

2*45 

2*28 

2*08 

1*84 

21 

4*32 

3*47 

3*07 

2*84 

2*68 

2*57 

2*42 

2*25 

2*05 

1*81 

22 

4*30 

3*44 

3*05 

2*82 

2*66 

2*55 

2*40 

2*23 

2*03 

1*78 

23 

4*28 

3*42 

3*03 

2*80 

2*64 

2*53 

2*38 

2*20 

2*00 

1*76 

24 

4*26 

3*40 

3*01 

2*78 

2*62 

2*51 

2*36 

2*18 

1*98 

1*73 

25 

4*24 

3*38 

2*99 

2*76 

2*60 

2*49 

2*34 

2*16 

1*96 

1*71 

26 

4*22 

3*37 

2*98 

2*74 

2*59 

2*47 

2*32 

2*15 

1*95 

1*69 

27 

4*21 

3*35 

2*96 

2*73 

2*57 

2*46 

2*30 

-2*13 

1*93 

1*67 

28 

4*20 

3*34 

2*95 

2*71 

2*56 

2*44 

2*29 

2*12 

1*91 

1*65 

29 

4*18 

3*33 

2*93 

2*70 

2*54 

2*43 

2*28 

2*10 

1*90 

1*64 

SO 

4*17 

3*32 

2*92 

2*69 

2*53 

2*42 

2*27 

2*09 

1*89 

1*62 

40 

4*08 

3*23 

2*84 

2*61 

2*45 

2*34 

2*18 

2*00 

1*79 

1*51 

60 

4*00 

3*15 

2*76 

2*52 

2*37 

2*25 

2*10 

1*92 

1*70 

1*39 

120 

3*92 

3*07 

2*68 

2*45 

2*29 

2*17 

2*02 

1*83 

1*61 

1*2$ 

ee 

3*84 

2*99 

2*60 

2*37 

2*21 

2*09 

1*94 

1*75 

1*52 

1*00 


JLower 5 pet cent pointa are lonnd by mterchange of and p,, ut, Vi mitet alwaye 
correapoiid witli tbe greater mean aqnare 
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APPEM>1X TABLE B--{continued)Siitdikmxc^ paints of the varUmceHratio P 


B. 1 per cent points 

Rc|»roduced from Fisher and Yates : Statistical Tables for Biological^ Medical and AgrieuStured Research, 
Oliver and Boyd Ltd«, Edinburgh, by permission of the authors and publishers 


>' l \ 

1 

2 

3 

4 

5 

6 

8 

12 

24 

00 

1 

4052 

4999 

5403 

5625 

5764 

5859 

5981 

6106 

6234 

6366 

2 

98-49 

99*00 

99*17 

99*25 

99-30 

99-33 

99-36 

99*42 

99-46 

99-50 

3 

34 12 

30.81 

29*46 

28*71 

28-24 

27*91 

27-49 

27*05 

26-60 

26-12 

4 

21-20 

18-00 

16-69 

15*98 

15-52 

15-21 

14*80 

14*37 

13-93 

13-46 

5 

16-26 

13-27 

12*06 

11*39 

10*97 

10-67 

10-27 

9*89 

9-47 

9-02 

6 

13-74 

10-92 

9*78 

9*15 

8*75 

8-47 

8*10 

7*72 

7-31 

6-88 

7 

12*25 

9*55 

8*45 

7-85 

7*46 

7*19 

6*84 

6*47 

6*07 

5-65 

8 

11-26 

8*65 

7-59 

7*01 

6*63 

6*37 

6*03 

5-67 

5*28 

4-86 

9 

10-56 

8*02 

6*99 

6*42 

6*06 

5*80 

5*47 

5*11 

4-73 

4*31 

10 

10*04 

7*56 

6*55 

5-99 

5*64 

5*39 

5*06 

4*71 

4-33 

3*91 

11 

9*65 

7*20 

6*22 

5-67 

5-32 

5*07 

4*74 

4*40 

4-02 

3-60 

12 

9*33 

6*93 

5*95 

5*41 

5*06 

4*82 

4*50 

4-16 

3-78 

3-36 

13 

9*07 

6*70 

5*74 

5*20 

4*86 

4*62 

4*30 

3-96 

3*59 

3-16 

14 

8*86 

6*51 

5*56 

5*03 

4*69 

4*46 

4-14 

3*80 

3-43 

3-00 

15 

8*68 

6*36 

5*42 

4-89 

4*56 

4-32 

4*00 

3*67 

3*29 

2-87 

16 

8*53 

6*23 

5*29 

4*77 

4-44 

4-20 

3*89 

3*55 

3-18 

2-75 

17 

8-40 

6*11 

5*18 

4-67 

4*34 

4-10 

3*79 

3*45 

3*08 

2-65 

18 

8*28 

6*01 

5*09 

4-58 

4-25 

4-01 

3*71 

3*37 

3-00 

2-57 

19 

8-18 

5*93 

5*01 

4*50 

4-17 

3*94 

3*63 

3*30 

2-92 

2-49 

20 

8*10 

5*85 

4*94 

4*43 

4-10 

3*87 

3*56 

3*23 

2^86 

2-42 

21 

8-02 

5*78 

4*87 

4*37 

4*04 

3*81 

3*51 

3-17 

2-80 

2-36 

22 

7*94 

5*72 

4-82 

4*31 

3*99 

3*76 

3*45 

3*12 

2*75 

2*31 

23 

7*88 

5-66 

4*76 

4*26 

3*94 

3*71 

3*41 

3-07 

2-70 

2-26 

24 

7*82 

5-61 

4-72 

4*22 

3-90 

3*67 

3*36 

3*03 

2-66 

2-21 

25 

7*77 

5*57 

4*68 

4*18 

3-86 

3*63 

3-32 

2*99 

2-62 

2-17 

26 

7*72 

5-53 

4*64 

4*14 

3-82 

3*59 

3*29 

2-96 

2-58 

2-13 

27 

7*68 

5*49 

4*60 

4*11 

3*78 

3*56 

3*26 

2-93 

2-55 

2-10 

28 

7*64 

5-45 

4*57 

4 * 07 > 

3*75 

3*53 

3*23 

2-90 

2-52 

2-06 

29 

7*60 

5*42 

4*54 

4*04 

3-73 

3*50 

3-20 

2-87 

2-49 

2-03 

30 

7*56 

5*39 

4-51 

4*02 

3-70 

3*47 

3*17 

2-84 

2*47 

2-01 

40 

7-31 

5*18 

4-31 

3-83 

3-51 

3*29 

2*99 

2-66 

2-29 

1-80 

60 

7*08 

4-98 

4-13 

3*65 

3-34 

3-12 

2-82 

2-50 

2*12 

1*80 

120 

6-85 

4-79 

3*95 

3*48 

3-17 

2*96 

2*66 

2*34 

1*95 

1*38 

00 

6-64 

4-60 

3-78 

3-32 

3-02 

2-80 

2-51 

2*18 

1-79 

1-00 


Lower 1 per cent points are found by interchange of and ie. must always 
correspond with the greater mean square 
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appendix table 5 — {continued ) — Slgniftamce points of liie vaxiftnceHratlo F 
C. 0-1 per cent points 

Rqnoduced from Fisber and Yates : StatisHeat Tabks for Biological, Medical and AgrieaBitml Raumrek, 
Oliver and Boyd Ltd.* Edinburgh, by perxnission of the authors and publishers 



Lomr per cent points are loiiiid by intesobaiiie of and Pp te^ pg matt 
oomspond wi^ tbe gieater mean ccioikin 
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APPENDIX TABLE 6.— Significance points of the «eMbation at x 

A. 5 per cent points 

Eeproduoed by kind permission of Professor K. A. Fisher and Messrs. Oliver and Boyd from the fonner's 
StaHsHcal Methods for Research Workers 


B 

1 

2 

3 

4 

5 

6 

8 

12 

24 

00 


2-5421 2-6479 2-6870 2-7071 2-7194 2 

•7276 2 

•7380 2 

•7484 2 

•7588 2 

•7693 

2 

1-4592 1 

•4722 1 

•4765 1 

•4787 1 

•4800 1 

•4808 1 

•4819 1 

•4830 i 

•4840 1 

•4851 

3 

1*1577 1 

•1284 1 

•1137 1 

•1051 1 

•0994 1 

•0953 1 

*0899 1 

•0842 1 

•0781 1 

•0716 

4 

1-0212 

•9690 

•9429 

•9272 

•9168 

-9093 

•8993 

•8885 

•8767 

•8639 

5 

•9441 

•8777 

•8441 

•8236 

•8097 

•7997 

•7862 

•7714 

•7550 

•7368 

6 

-8948 

•8188 

•7798 

•7558 

•7394 

•7274 

•7112 

•6931 

•6729 

•6499 

7 

•8606 

•7777 

•7347 

•7080 

•6896 

•6761 

•6576 

•6369 

•6134 

•5862 

8 

-8355 

•7475 

•7014 

•6725 

•6525 

•6378 

•6175 

•5945 

•5682 

•5371 

9 

•8163 

•7242 

•6757 

•6450 

•6238 

•6080 

•5862 

•5613 

•5324 

*4979 

10 

•8012 

•7058 

•6553 

•6232 

•6009 

•5843 

•5611 

•5346 

•5035 

•4657 

11 

-7889 

•6909 

•6387 

•6055 

•5822 

•5648 

•5406 

•5126 

•4795 

•4387 

12 

•7788 

•6786 

•6250 

•5907 

•5666 

•5487 

•5234 

•4941 

•4592 

•4156 

13 

•7703 

•6682 

•6134 

•5783 

•5535 ' 

•5350 

•5089 

•4785 

•4419 

•3957 

14 

•7630 

•6594 

•6036 

•5677 

•5423 

•5233 

•4964 

•4649 

•4269 

•3782 

15 

•7568 

*6518 

•5950 

•5585 

•5326 

•5131 

•4855 

•4532 

•4138 

•3628 

16 

•7514 

•6451 

•5876 

•5505 

•5241 

•5042 

•4760 

•4428 

•4022 

•3490 

17 

•7466 

•6393 

•5811 

•5434 

•5166 

•4964 

•4676 

•4337 

•3919 

•3366 

18 

•7424 

•6341 

•5753 

•5371 

•5099 

•4894 

•4602 

•4255 

•3827 

•3253 

19 

•7386 

-6295 

•5701 

•5315 

•5040 

•4832 

•4535 

•4182 

•3743 

•3151 

20 

•7352 

•6254 

•5654 

•5265 

•4986 

•4776 

•4474 

•4116 

•3668 

•3057 

21 

•7322 

•^16 

•5612 

•5219 

•4938 

•4725 

•4420 

•4055 

•3599 

•2971 

22 

•7294 

•6182 

•5574 

•5178 

•4894 

•4679 

•4370 

•4001 

•3536 

*2892 

23 

•7269 

•6151 

•5540 

•5140 

•4854 

*4636 

•4325 

•3950 

•3478 

•2818 

24 

•7246 

•6123 

•5508 

•5106 

•4817 

•4598 

•4283 

•3904 

*3425 

•2749 

25 

•7225 

•6097 

•5478 

•5074 

•4783 

•4562 

•4244 

•3862 

•3376 

•2685 

26 

•7205 

•6073 

•5451 

•5045 

•4752 

•4529 

•4209 

•3823 

•3330 

*2^ 

27 

•7187 

•6051 

•5427 

•5017 

•4723 

•4499 

•4176 

•3786 

•3287 

•2569 

28 

•7171 

•6030 

•5403 

•4992 

•4696 

•4471 

•4146 

•3752 

*3248 

•2516 

29 

•7155 

•6011 

•5382 

*4969 

•4671 

•4444 

*4117 

•3720 

•3211 

•2466 

30 

•7141 

•5994 

•5362 

•4947 

•4648 

•4420 

•4090 

•3691 

•3176 

•2419 

60 

•6933 

•5738 

•5073 

•4632 

•4311 

•4064 

•3702 

•3255 

•2654 

•1644 

8 

•6729 

•5486 

•4787 

•4319 

•3974 

•3706 

*3309 

•2804 

•2085 

0-000 
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API>ENDIX TABLE 6— (con/^.)— S ignificance pdnte of Hie dletribntloii of t 

B, 1 per cent points 

Reproduced by kind permission of Professor R. A. Fisher and Messrs. Oliver and Boyd from the former’s 
SiaHstieal M^Hods for Research Workers 


m 

1 

2 

3 

4 

5 

6 

8 

12 

24 

on 

1 

4-1535 4-2585 4-2974 4-3175 4-3297 4-3379 4-3482 4-3585 4-3689 4-3794 1 

2 

2-2950 2-2976 2-2984 2-2988 2-2991 

2-2992 2-2994 2*2997 2*2999 2*3001 1 

3 

1-7649 1-7140 

1*6915 

1 *6786 

1-6703 

1-6645 

1*6569 

1*6489 1 

•6404 

]*63ld 

4 

1-5270 1-4452 

1 *4075 

1-3856 

1*3711 

1*3609 

1-3473 

1-3327 1 

•3170 1 -30061 

5 

1-3943 

I -2929 

1 *2449 

1*2164 

11974 

1*1838 

1*1656 M457 1 

•1239 

1-0991 

6 

1-3103 

1*1955 

1*1401 

1*1068 

1-0843 

1*0680 

1-0460 

1-0218 

•9948 

•9643' 

7 

1-2526 

1-1281 

1-0672 

1 0300 

1*0048 

•9864 

•9614 

•9335 

•9020 

•8658 

8 

1-2106 1-0787 

1*0135 

*9734 

•9459 

*9259 

•8983 

•8673 

•8319 

-7904 

9 

1-1786 

1-0411 

*9724 

*9299 

•9006 

•8791 

•8494 

•8157 

•7769 

-7305 

10 

1-153S 

1-0114 

*9399 

•8954 

*8646 

*8419 

•8104 

*7744 

•7324 

-6816 

11 

1-1333 

-9874 

*9136 

•8674 

•8354 

*8116 

*7785 

*7405 

*6958 

*6408 

12 

M166 

*9677 

•8919 

*8443 

•8111 

•7864 

•7520 

•7122 

•6649 

•6061 

13 

1-1027 

•9511 

•8737 

•8248 

•7907 

*7652 

•7295 

-6882 

'6386 

*5761 

14 

1 -0909 

•9370 

•8581 

•8082 

-7732 

*7471 

*7103 

*6675 

•6159 

•5500 

15 

1-0807 

-9249 

-8448 

•7939 

•7582 

*7314 

•6937 

•6496 

'5961 

*5269 

16 

1-0719 

*9144 

*8331 

•7814 

*7450 

•7177 

*6791 

•6339 

5786 

*5064 

17 

1-0641 

-9051 

•8229 

-7705 

•7335 

•7057 


*6199 

•5630 

•4879 

18 

1-0572 

*8970 

•8138 

*7607 

*7232 

•6950 

•6549 

•6075 

•5491 

-4712 

19 

1-0511 

•8897 

•8057 

•7521 

*7140 

*6854 

•6447 

-5964 

•5366 

•4560 

20 

1-0457 

-8831 

•7985 

*7443 

-7058 

•6768 

•6355 

*5864 

•5253 

-4421 

21 

1-0408 

-8772 

*7920 

•7372 

•6984 

•6690 

•6272 

•5773 

•5150 

•4294 

22 

1-0363 

-8719 

*7860 

-7309 

*6916 

•6620 

•6196 

•5691 

■5056 

•4176 

23 

1-0322 

•8670 

*7806 

-7251 

-6855 

•6555 

'6127 

-5615 

•4969 

•4068 

24 

1*0285 

•8626 

•7757 

•7197 

•6799 

*6496 

•6064 

*5545 

•4890 

•3967 

25 

1 *0251 

-8585 

*7712 

*7148 

*6747 

*6442 

•6006 

•5481 

•4816 

•3872 

26 

1*0220 

•8548 

*7670 

•7103 

*6699 

*6392 

*5952 

*5422 

•4748 

*3784 

27 

1*0191 

•8513 

•7631 

*7062 

•6655 

-6346 

•5902 

-5367 

•4685 

*3701 

28 

1-0164 

•8481 

*7595 

•7023 

*6614 

*6303 

*5a56 

•5316 

•4626 

*3624 

29 

1*0139 

-8451 

•7562 

*6987 

•6576 

*6263 

-5813 

•5269 

•4570 

*3550 

30 

1*0116 

-8423 

•7531 

•6954 

•6540 

*6226 

•5773 

-5224 

*4519 

. *3481 

m 

-9784 

*8025 

*7086 

*6472 

•6028 

•5687 

•5189 

'4574 

'3746 

•2352 

iD 

1 *9462 

-7636 

•6651 

•5999 

•5522 

•5152 

-4604 

*3908 

2913 0-0000 
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APPENDIX TABLE S-^{contd.)Siti^UMee points of the distrlbiitlon of z 

C . 0-1 per cent points 

Reproduced by kind permission of Professor R. A. Fisher, Dr. W. E. Deming and Messrs. Oliver and Boyd 
from Prof. Fisher’s Stattstical Methods for Research Workers 


\ 

1 

2 


3 


4 


5 


6 

8 

12 

24 

00 

1 

6-4562 6-5612 6 

•5966 6 

•6201 

6 

•6323 6 

•6405 

6-6508 6-6611 

6*6715 

6-6819 

2 

3-4531 

3-4534 

3 

•4535 

3 

•4535 

3 

•4535 

3 

•4535 

3-4536 

3-4537 

3-4536 

3-4536 

3 

2-5604 

2-5003 

2 

•4748 

2 

•4603 

2 

■4511 

2 

•4446 2-4361 

2-4272 

2-4179 

2-4081 

4 

2-1529 2 0574 

2 

•0143 

1 

•9892 

1 

•9728 

1 

•9612 

1-9459 

1-9294 

1*9118 

1-8927 

5 

1-9255 

1 *8002 

1 

•7513 

1 

•7184 

1 

•6964 

1 

•6808 

1-6596 

1-6370 

1*6123 

1-5845 

6 

1-7849 

1-6479 

1 

•5828 

1 

•5433 

1 

•5177 

1 

•4986 

1*4730 

1-4449 

1-4134 

1-3783 

7 

1-6874 

1-5384 

1 

•4662 

1 

•4221 

1 

•3927 

1 

•3711 

)-3417 

1-3090 

1-2721 

1-2296 

8 

1-6177 

1-4587 

1 

•3809 

1 

•3332 

1 

•3008 

1 * 

•2770 

1-2443 

1-2077 

11662 

1-1169 

9 

1-5646 

1-3982 

1 

•3160 

1 

•2653 

1 

*2304 

1 

•2047 

11694 

M 293 

1 0830 

1-0279 

10 

1-5232 

1-3509 

1 

•2650 

1 

•2116 

1 

•1748 

1 

•1475 

1*1098 

1-0668 

10165 

•9557 

11 

1-4900 

1-3128 

1 

'2238 

1 

•1683 

1 

•1297 

1 

•1012 

10614 

10157 

•9619 

•8957 

12 

1-4627 

1-2814 

1 

•1900 

1 

•1326 

1 

•0926 

1 

•0628 

1*0213 

•9733 

•9162 

•8450 

13 

1-4400 

1-2553 

1 

-1616 

1 

•1026 

1 

•0614 

1 

•0306 

•9875 

•9374 

•8774 

•8014 

14 

1-4208 

1-2332 

1 

•1376 

1 

•0772 

1 

•0348 

1 

•0031 

•9586 

• t/UOO 

•8439 

•7635 

15 

1-4043 

1-2141 

1 

•1169 

1 

•0553 

1 

•0119 


•9795 

•9336 

•8800 

•8147 

•7301 

16 

1-3900 

1-1976 

1 

•0989 

1 

•0362 


•9920 


•9588 

-9119 

•8567 

•7891 

•7005 

17 

1-3775 

1-1832 

1 

-0832 

1 

•0195 


•9745 


•9407 

•8927 

-8361 

•7664 

•6740 

18 

1-3665 

1*1704 

1 

•0693 

1 

•0047 


•9590 


•9246 

•8757 

•8178 

•7462 

•6502 

19 

1-3567 

1-1591 

1 

•0569 


•9915 


•9442 


•9103 

•8605 

•8014 

•7277 

•^5 

20 

1-3480 

1-1489 

1 

-0458 


•9798 


•9329 


•8974 

•8469 

•7867 

•7115 

•6086 

21 

1-3401 

1-1398 

1 

•0358 


•9691 


•9217 


•8858 

•8346 

•7735 

•6964 

-5904 

22 

1-3329 

1*1315 

1 

-0268 


•9595 


•9116 


•8753 

•8234 

•7612 

-6828 

•5738 

23 

1-3264 

1-1240 

1 

-0186 


•9507 


•9024 


•8657 

•8132 

•7501 

•6704 

•5583 

24 

1-3205 

1-1171 

1 

-0111 


•9427 


•8939 


-8569 

*8038 

•7400 

•6589 

•5440 

25 

1-3151 

1-1108 

1 

-0041 


-9354 


•8862 


•8489 

•7953 

•7306 

•6483 

•5307 

26 

1-3101 

1*1050 


-9978 


•9286 


•8791 


•8415 

•7873 

•7220 

•6385 

•5183 

27 

1-3055 

1*0997 


•9920 


-9223 


•8725 


•8346 

•7800 

•7140 

•6294 

•5066 

28 

1-3013 

1-0947 


•9866 


-9165 


-8664 


•8282 

•7732 

•7066 

•6209 

•4957 

29 

1-2973 

1*0903 


-9815 


•9112 


•8607 


•8223 

•7679 

•6997 

•6129 

•4853 

30 

1-2936 

1-0859 


-9768 


•9061 


-8554 


•8168 

•7610 

•6932 

•6056 

•4756 

40 

1 -2674 

1-0552 


•9435 


-8701 


•8174 


-7771 

•7184 

•6463 

•5513 

•4016 

60 

1-2413 

1-0248 


*9100 


*8345 


-7798 


•7377 

• 67 i >0 

•5992 

-4955 

•3198 

00 

1-1910 

-9663 


•8453 


•7648 


•7059 


-6599 

•5917 

•5044 

•3786 0-0000 







ANSWERS TO THE EXERCISES 

AND HINTS ON THEIR SOLUTION 


CHAPTER 1 


N 

26,287 

(AB) 

887 

[A) 

2,308 

(AC) 

374 

(B) 

2,853 

(BC) 

353 

(C) 

749 

(ABC) 

149 

{ABC) 

156 

(aBC) 

179 

(ABy) 

431 

(*By) 

1,249 

(AfiC) 

272 

(ccfiC) 

163 

(Afiy) 

759 

(«^y) 

20,504 


1 .3. The frequencies not given in the question itself are — 

(a) (AB) 107 {AC) 405 (BC) 525. 

(ft) (Afiy) 22,980 (aBy) 13,585 (afiC) 96,478 (afly) 28,868,495. 


1.4. 

(^B) (8) 

(AB) ^ (B) 

(Afi) ^ (fi) 

• (AB) + (Afi)^ (B)+(fi) 

that is 

(AB) (A) 
(B) AT' 

that ia 

(Bj-(AB)^ N-(A) 

that is 

(AB) (A) 
(oTB) ^ (ij 



1.7, 160. Take w4«husband exceeding wife in first measurement, B«husband 
exceeding wife in second measurement, and find (ctfi), 

1.8, 38. If A, B, C denote passing first, second and third examinations, (C), 
(o^C) and {ABy) are all that is necessary to answer the question. The other five 
frequencies (including N) are redundant. 

Further, iST- (cc^C)--(«/Jy)«(i4)+{B)— (ilBC) — (.4By), i.e. there is a linear 
relation between the given frequencies and the ultimate frequencies are therefore 
indeterminate. 

1.9, 10 per cent. 

1.11. Denoting government, voting for the motion and English membership by 
A, B, C, we have (^BC)«300, {aBC)^$3, {AfiC) ^10, (a^C)« 102, {.4By)«3ft 
faBy)«72, (^^y)«8, {afiy)^2S. » \ n ^ 

1.13. 80/263 or 304 per thousand. 

1.14. 55/85 or 65 per cent, 

1.15. 32 per cent and 30 per cent 

1.16. 117. 

1.17. 108. 

1.20. (1 -2g), p^\ (1 +2j), i.e, p must lie between 0 and J (1 -2 j) or between 
HI +2$) and*. 

1.21. AS a hint, remember the condition that— 

(BC)>{B)+(C)^J^ 

1.22 If A^ B, C denote liking chocolates, toSee or boiled sweets, in negative. 

<75 
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THKOBY OF STATISTICS 


CHAPTER 2 


2.1. Deaf-mutes from childhood per million among males 222; among females 
183 ; there is therefore positive association between deaf- mutism and male sex : if 
there had been no association between deaf -mutism and sex, there would have been 
3,176 male and 3,393 female deaf-mutes. 

2.2. (a) Positive association, since (AR)o» 1.457. 

(6) Negative association, since 294/490 3/5, 380/570**2/3. 

(c) Independence, since 256/768= 1/3, 48/144=1/3. 

2.3. Percentage of Plants above the Average Height 

Parentage Crossed Self-fertilised 

Ipomaea purpurea . 86 per cent 25 per cent 

Petunia violacia . 79 „ 17 „ 

Reseda lutea . 78 „ 34 „ 

Reseda odorata . . 71 ., 45 

Lobelia fulgens 50 35 , 

The association is much less for the species at the end than for those at the beginning 
of the list. 

2.4. Percentage of dark eyed amongst the sons of dark-eyexl fathers 39 per cent. 

Percentage of dark-eyed amongst the sons of not dark-eyed fathers 10 per cent. 

If there had been no heredity, the frequencies to the nearest unit would have been 

(AB). 18, 111, (aR), 121, (otfiU 750. 

2.5. Percentage of light-eyed amongst the wives of light-eyed husbands 59 per cent. 

Percentage of light-eyed amongst the wives of not light-eyed husbands 53 per cent. 

If there had been no association ; (AR)^=298, (A/ff)p=22S, (aB)Q=l43, (0^)0=108. 

2.6. The following are the proportions of the insane per thousand in successive 
age-groups — 

In general population : 0-9, 2*3, 4*1, 5*7. 6-9, 7*5, 7*7, 6*8 
Amongst the blind: 20*1, 16 0. 16*3, 20*7, 18*3. 17*8, 11*4, 5*3 

Note the diminishing association, which is especially clear in the age-group 65-, 
and the negative association in the last age-group. The association coefficient gives 
the values below, which decrease continuously — 

Association coefficient; -f0*92, -i-0*75, -fO'Ol, -i-0-57, 4-0*46, 4-0*41. 4-0*20, 

-0-13. 

2.10. -h0*90, 

2.11. -f-0*70. 
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amongst those exhibiting nerve signs, as compared with those who do not exhibit 
nerve signs, or with the girls in general. As the association amongst those who do 
not exhibit nerve signs is quite as high as for the girls in general, the conclusion " 
quoted does not seem valid. 


.15. 

(1) 

(2) 


(1) 

(2) 


Per 

Per 


Per 

Per 


thousand 

thousand 


thousand 

thousand 

{B)/N 

3*2 

7*5 

{A)/N 

0*9 

4*0 

(AB)/(A) 

14*9 

11*7 

(AB)/{B) 

4-0 

6*3 

(BC)/(C) 

38*8 

63*0 

{AC)/{C) 

6*6 

18*8 

{ABC) /(AC) 

216 

214 

(ABC)/(BC) 

36*8 

63*8 


The above give the two simplest comparisons, either of which is sufficient to show 
that there is a high association between blindness and mental derangement amongst 
the deaf-mutes as well as association in the general population ; amongst the old, 
the association is, in fact, small for the general population, but well-marked for deaf- 
mutes. This result stands in direct contrast with that of Exercise 2.14, where the 
association between the two defects A and D was much smaller in the defective popula- 
tion p than in the population at large. As previously stated, no great reliance can be 
placed on the census data as to these infirmities. 

2.16, If the cancer death-rates for farmers over 45 and under 45 respectively were 
the same as for the population at large, the rate for all farmers over 16 would be 
2*726. This is slightly greater than the actual value 2*633 but the difference would 
not justify any statement that “ farmers were peculiarly liable to cancer,” or not. 

2.17. 15 per cent, 

2.19. If A and B were independent in both C and y populations, we should have 
{AB) equal to x 419 , 151 X 139 , 

617 ■*' 383 ^ 

Actually (AB) is only 358. Therefore A and B must be disassociated in one partial 
population or both. 

2.22. (1) 68*1 per cent. (2) 42*5 per cent. The possible fallacy that a total 
association between ” spending more than one’s opponent ” and ” winning ” only 
meant that Conservatives spent more and that Conservative principles carried the 
day is now avoided, and there seems no reason for declining to consider this as evidence 
of the effect of expenditure on election results. 

2.23. The limits to y are j-<l(3;r-*»-l) 

>i(*+**) 

subject to the conditions y^x, y^O, y^2;r— 1. No inference of a positive association 
from two negatives is possible unless lies between the limits 0*382 ..,,0*618... 

2.24. The limits to y are 

(1) y<i(ex^ex^-^l) 

>i(xA-6x*) 

subject to conditions y^O, ^4x— 1, 

An inference is only possible from positive associations oi A B and if x | ; 
an inference is only possible from tw*o negative associations if x lie between 0*211 . . . 
and 0*274 . . . Note that x cannot exceed 

(2) y<J(6;r~3x»-l) 

>t(2x+3x*) 

subject to conditions ^'^O, ^5Ar— 1, 

No inference is possible from positive associations of A B and BC. 

An inference is only possible from negative associations if x lie between 0*183. . . 
and 0*215 .. . Note that x cannot exceed 

(3) y<|(6Ar-2Ar*-l) 

>l(3;r-i-2x*) 

subject to the conditions y^O, ^5;r--l, ^at. 

As in (2)* no inference is possible from positive associations ol AC and BC; an 
Inference is possible from negative associations if x lie between 0*177 . . , and 
0*224 , . . Note that x cannot exceed 
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THBOEY 09 STATISTICS 


CHAPTER 3 


3.1, A, 0-68; B. 0*36, 

3.2. C*0 02, r«o oi. 

3.4. Th« table is not isotropic as it stands. It becomes positively so if the columns 
are arranged in the order Ai, A^, Ag, Ag, Ag^ and the rows in order {itom top to bottom) 

B,. B,. 

3.5. C«0 05. T^0 03, 

3.7. C*«0'40. For a large number such as 1,000 this is probably signihcant. i.e. 
not due to fluctuations of sampling. From inspection of the tables the contingency 
is positive, i.e. this evidence would suggest that persons tended on the whole to prefer 
music of their own nationaliiy. Bur^ere are exceptions, e.g. the English. 

In any case these data are purely imaginary, and it is not suggested that they reflec^ 
in any ^y the true state of afiairs. \ 

3.8. C«0*23, TasO* 17 suggestive of slight association. 

3.10. C«010. 


CHAPTER 4 


4.1. 1200, 200. 

4.2. 270, 40. 

4.3. 92*375. 

4.4. 216*5 

4.5. (a) J-shaped ; (b) U-shaped ; {c) single-humped moderately asymmetrical; 

(d) J-s^ped in all three cases. 


CHAPTER 5 

5.2. 14*58. 

5.3. Mean, 156*73 lb. Median. 154*67 lb. Mode (approx.), 150*6 lb. (Note that 
the mean and the median should ^ taken to a place of decimals further than is desired 
forthemode; the true mode, found by fitting a theoretical frequency curve, is 151 • I Ib.) 

5.4. Mean* 0*6330. Median, 0*6391. Mode (approx.), 0*651. (True mode is 
0*653.) 

5.5. About ;f3,250. 

5.6. Maaa-?i?. 

5.7. (I) 82-75, <2) 81-78, (3) 80-25. (4) 80-25. 

5.8. Aritbmetic ineaa.-j^^(2"-n — 1). 

n 

Gemnetrlc mean »*2^. 

Harmonk mean — ? 

5-9. Meanwap. If the terms of the given binomial series are mhltipli^ by 0, 1» 
2, . . .. note that the resulting series is also a binomial when a common factor is lemovea. 
(A full proof is gtvMEi in Chapter 10.) 

i-n. (1) 921,507, (2) 916,963. 

i-lt. For K.M. apeciali, 15s. Id. per 120 ; for ordtnariei, I2s« 9d. per 120- 
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CHAPTER 6 

6.2. Standard deviation 21*3 lb. Mean deviation 16*4 lb. Lower qnartile 142»5, 
upper qnartile 168-4; whence j3««12*95. Ratios: m.d./8.d.«»0*77. g/s.d.«a0*61. 

6.3. Median»;f3,2$0, upper quaitile«;£5,000, 9th decile 600 approximately. 

6.4. 0,«»24*13 years. M6dian»27*29 years. 0,«»32*19 years. 0«s4*O3 years. 

6.5. 2-872. 

6.6. This proposition is equivalent to the one that the square of the mean of a set 
of positive numbers is less than the mean of the squares. This is proved in most 
textbooks on Algebra. 

6.8. (1) M«73-2, <r«17-3; (2) M«73-2. or« 17- 5 ; (3) Af«73-2. or«lS-0. 
(Note that while the mean is unaffected in the first place of decimals, the standard 
deviation is higher the coarser the grouping.) 

6.9. England, a«=2*55; Scotland, <r«2*48 ; Wales, cr=2-33; Ireland, ore32-15 
inches. For the weight distribution o’=*21-14 lb. 

6.10. Vnpq, The proof is given in Chapter 8. 

6.11. The assumption that observations are evenly distributed over the intervals 
does not affect the sum of deviations, except for the interval in which the mean or 
median lies ; for that interval the sum is n,(0-25-|-d*), hence the entire correction is 

d(»ii~M,)+n,(0-25-fd*) 

In this expression (f is. of course, expressed as a fraction of the class^interval, and is 
given its proper sign. 

6.14. 3-80, 3-65, 3-53, 3-20. 


CHAPTER 7 


7.1. In class4ntervals of 10 lb. 

/i,«4-470, /t,«6-927. |*4«89-119; >?,«0-537, /^j«4-461. 

Curve leptokurtic. 

7.2. 0-06, 0-29, 0-27. 

7.3. 1-375. ;t,e=:12*705, /44«428*708, in class-intervals of 1 gallon. 
/?i«0-n0, /?,«3-313. 

Measures of skewness are 0-027, 0-14, 0-15. The second is obtained by approxi- 
mating to the mode in the manner of 5.26. 

7.4. Before corrections, /*,«=7-301, /ij*»0-166, 163-465 ; 

After corrections, /*,«6-551, /(4»0*166, /t 4 « 132-975. 

Note that the small negative /i, in the finer grouping becomes positive in the coarser 
grouping. 

7.5. ft$’^npq{g-p). 

—6p9). 

7.6. About the mean, ;i,*= 14*75, /*j«:39*75. /( 4 » 142*3125. 

About the origin, /e,'«21, /t4'««166, ^4'«1132. 

7.8. This proposition is equivalent to that of Exercise 6.6. For U-shaped pomiia* 

tions /?*<2. ^ ^ 

7.9. X4«7-657, #f*«36*152, X4-259-335. 


CHAPTER 8" 


8.1, 27*31 per cent. 

8 . 2 . Expected frequendes are : 1. 12, 86. 220, 495, 792. 924, 792. 495, 220. 86. 12. 1. 
Expected meaa«>e: expected 0'->t-732. 

Aetval mean —8-139; actaal r— 1>?11 
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a, ... 4096 /2(r-7r2)* 

y 1 - 712 ^^ 

Expected frequencies, to nearest units, are : 2. 11, 51, 178, 438, 765, 951, 841, 529 
236, 75, 17, 3, totalling 4097 ; (these are obtained by simple interpolation in Appendix 
Table 1). 

8.4. 17. 


8.5. If p is the expectation of getting an even number, 

Hence, ^■*1. and the number of times is 10,000(1)^®= once. 

8.8. The frequency of r successes is greater than that of r—l so long as r<inp-^p\ 
if np is an integer, fss^np gives the greatest term and also the mean. 

8.9. This follows at once h-om a consideration of the Galton-Pearson apparatus. 


Binomial 

Normal curv^e 

1 

1*7 

10 

10*5 

45 

42*7 

120 

116*1 

210 

211*5 

252 

258*4 

210 

211*5 

etc. 

etc. 


8.11. Mean 74*3, standard deviation 3-23. 


8.12. About zero mean the deciles are ; 
the corresponding negative values. 


8.13. 


__8585 ~2(2*57)*' 

2-57V2^ 


,(jf-67-46)» 


0, 0-2533, 0-5244, 0-8416, 1-2816, and 


Calculated mean and quartile deviations, 2*05 and 1-73 (observed, 2*02 and 1-75) 
These figures are in units of one inch. 

8.14. Calculated mean and quartile deviations (years), 6-37 and 5*38 (observed 
5*44 and 4-03). 

8.15. 18. 


8.16. eras 2 *267 (uncorrected). 

Theoretical frequenci^, 2, 5. 11. 20, 29, 35, 35, etc. 

8.17. Theoretical frequencies, 336-5, 397-1, 234*6. 92-5, 27-3, 6-5, 1-3, 0-2. 

8.18. <c,«l-362. #c,«*l-766, k^^Z SIO. 


CHAPTER 9 

9.1. <r,*l-414. <r»«r2-280, 0-81. 

A'^O ^y+O-S, y«l-3A'+M. 

9.2. r (between X and y) = — 0-66; between Y and Z«50-60; between Z and 
a:«*-o-i3. 

9.4. 4-0*96. 

9.5. (1) -^0*41, (2) +0*40. 


CHAPTER 10 

10.3. From equations (10. 11) and (10.12) replace and by Sj and Sf in equation 
(10.10). Regarding this as an equation for r, note that r* is a maximum when tan 
20 is infinite, or 0a»45^ 

llO.i. In 5g. 10.1 suppose every horizontal array to be given a slide to the right 
Wifi its mean lies on the vertical axis through the mean of the whole distribarion : 
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then suppose the ellipses to be squeezed in the direction of this vertical until 
they become circles. The original quadrant has now become a Sector with an angle 
between one and two right angles, and the question is solved on determining its 
magnitude. 

10.5. The ellipse is a horizontal section of the surface. Its equation is 

and the standard deviations of sections are the square roots of the lengths of radii 
vectors of the ellipse. 

10.6. The maximum and minimum s.d.'8 are given by the principal axes, which 
leads to equations (10.11) and (10.12). 

For an intermediate value there are two radii vectors and hence two sections. 

10.8. a and h must be negative, and 

b _ , , a 


2rxy y* 

■ 0-7 ' 


* 06 -A*’ 




ab-h* 


A 

V ab 


10.9. The sum of the ^th powers of the first n natural numbers is plus 

terms of lower order in n. 

10.10. Use equation (9.11). 


CHAPTER 11 

11.1. 7„=0-242, i;,.-.0-266. 

11.2. ^i^~0-82, 7 ,x=0-80. 

11.3. P-+0-79. 

11.4. If the judges be denoted by 1, 2. 3, 

— 0*21, Pti*®* — 0*30. Pi,“"f"0*64 

This suggests that judges 1 and 3 have tastes in common, but neither has much 
in common with judge 2. 

11.5. 0=2/3. 

11.6. 0=0-77. 

11.10. r=4-0-83. 

11. 11. r=+0-22, 11,868 entries. 


CHAPTER 12 


121- rin- +0-759, r„.,= +0-097, f'„.i= -0-436. 
<rj.„=2-64, o',^,,=0*594, o‘,^,,=70*I. 

.AT, = 9 - 3 1 + 3 • 37^-, + 0 - 00364 Jf ,. 

12.2. «,<,„=0-80, E,(„)=0-84, R,(,„-0-57. 

12-3. i'„.„-+0-68a. r„.„=+0-803. f„.„= +0-397. 

—0-433, •’,4.1,= — 0-553, rM.!,*" 0-149. 

»i.ti4“9-17, <r,.i,4'“12-5, o’ 4 .in— 105-4. 

X,=53+0- 127Jf,+0-587Af,+0-0345X4. 


12.4. *,<*>-0-87, *,<«4)“0-89. 

12.8. (J!r,-19-9)-4-51(X,-49-2)-0-88(Jf,- 

0 03. 
r„. 4 — +0-25. 
n..M-+0-23. 

1*»<im»).*“0*77. 


-30-2) 

-0 - 07a(jf 4-481 4) + 0 - e3(.ar,-4 1 -6) 


w* 
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12,7* Number of onlor 

Total number i«.i} 

This includes coefficients of type Rn 0 and counts as different from 

12.8. The correlation of the pth order is r/.(l+pr). Hence if f be negative, the 
correlation of order fi—2 cannot be numerically greater than unity and r cannot exceed 
(numerically) l/{n— 1), 

12.9. — 1, 

12 . 10 . 


CHAPTER 13 

13.1. In Table 9.5 the unit, being a weekly fi^re, is not modifiable to the extent 
that it relates to the situation at a given point of time. The choice of different intervals 
between the points (e.g. months) might, perhaps, give a somewhat different picture. 

In Table 9.6 the unit is a registration district and is modifiable by the amalgamation 
of districts. 

13.2. For this series 0*87. This is to be regarded as a nonsense-correlation, 
although a very profound analysis might suggest that the falling infantile mortality 
was due to technical progress which also made increases in population possible. 

13.3. During the period steam vessels were replaced by diesel oil burners to some 
extent and horse-drawn vehicles by oil-propelled vehicles. From this point of view 
the correlation is hardly nonsense, though the relationship is very remote. 


CHAPTER 14 

14.1. Estimated true standard deviation 6*91 ; standard deviation of fluctuations 
of sampliiig 9*38. (The latter, which can be independently calculated, is too low. 
and the former con^uently probably too high. Cf. 17.30.) 

14.2. 0*43. 

14.3. 58 per cent. 

H.4. «r,VV(«r,*+«r,*)(«r,*+«r,*) 

14.6. 0*29 

The others may be written down from symmetry. 

14.8. (1) No effect at all. (2) If the, mean value of the errors in variables is d, and 
in the weights e, the value found for tbe weighted mean is — 

The true value 

If r Is small, d is the important term, and hence errors in the quantities are usually 
of more importance than errors in the weights. If r become considerable, errors in 
the weights may be of consequence, but it does not seem probable that the second 
term would become the most important in practical cases. 

14.9. r« +0*036. 

inn var B 

Vf(vsr A+var ^{var B+var C)} 
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CHAPTER 15 

15.1. Line: y-.2-58+M3(Jf-2) 

Quadratic : y-.l-48+M3(Ar-2)+0-55(Jif-2)* 

Cubic : y— l-48+0-025(Jr-2)+0-55(X’-2)*+0-325(.Sr-2)* 

Sums of squares of residuals : 5-819, 1-584, 0-063. 

15.2. If y is the average number of children for the duration X to X-1-1 years — 


Line: y=3-814+0-887(^-3) 

Quadratic: y=4-351-t-0-887^'|-3'j-0- 134(^'|-3y 

Cubic: y=.4-351-f0-912(^'|-3^-0-134(^^-3y-0 00;i61^g-3y 

For X^n the three values are 4 • 17, 4 • 68, 4 • 69. 

15.3. y«l*42 

15.4. Gross output per 100 labour, y** gross output, 
y =48- 33 +0- 2375X-0-00005546^» 


CHAPTER 17 

17.1. Theo. <r«l-732: Actual M«=6116, <7«1‘732. 

17.2. (a) Theo. Ar«2*5. cr»=M18: Actual M«2*48, (7=* I *14. 

„ M«:3 <r«l-225: „ ilf«=2-97. cr«l -26. 

(c) „ M«3«5, cr« 1^323 : „ Af«3-47, <r=l-40. 

17.3. The standard deviation of the proportion is 0*00179, and the actual divergence 
is 5*4 times this, and therefore almost certainly significant. 

17.4. The standard deviation of the number drawn is 32, and the actual difierence 
from expectation 18. There is no significance. 

17.5. Difference from expectation 7*5 ; standard error 10*0. The difference might 
therefore occur frequently as a fluctuation of sampling. 

17.6. Standard error of proportion of bad eggs« I *6536 per cent. A range of three 
times this gives range of 7*5 per cent to 17*5 per cent approximately. 

17.7. The test can be applied either by the formulae of Case 2 (17.28) or those of 
Case 3 (17.29). Case 2 is taken as the simplest. 

(^B)/(B)«70*1 per cent.; per cent. 

Difference 5*8 per cent. (i4)/isr«67*6 per cent and thence 6it«=3*40 per cent. The 
actual difference is 1 *7 times this and might, rather infrequently, occur as a fluctuation 
of sampling. 

17.9. Difference of proportions*® Cj,** 0*033. Difference significant. Similar 
conclu^ons follow if the formulae of Case 3 (17.29) are applied. 

17.10. Proportion«»36 per cent. Limits 32*4—39*6 per c^t. The sampling is 
almost certainly not simple. Possible causes are : (a) nature of subjwt-^tter might 
requite wends of certain type* e.g. scientific words probably would not be Anglo-Saxon , 
(b) the occurrence of one word influences the occurrence of the next. 

17.11. If there are /| samites of individuals each, /, of etc., 
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17.12. Standard error of expected pit^ortionni23*05 per cent. 

Standard deviation of actual distribution » 23 *u9 per cent. 

17.13. Standard deviation of simple sampling 23*0 per cent. The actual standard 
deviation does not, therefore, seem to indicate any real variation, but only fluctuations 
of sampling. 

17.14. cr^t^npq as if the chance of success were p in all cases (but the mean is n/ 2, 
not pn), 

17.17. Mean number of deaths per annum 680. 

<r*«566,582 * f «0*000029. 


CHAPTER 18 


18.1. P«0 1773. 

18.2. P«:0*9595. 

18.3. Median ; Estimated frequency «= 1554. Standard error 0*28 lb. 

Lower Q : frequency 1472, Standard error 0*26 lb. 

Upper Q : frequency 1116. Standard error 0*34 lb. 

18.4. 0*18 lb. 

18.5. 0*24 lb., 14 per cent less than the s.e. of the median. 

18.6. Estimated frequencies : Qi«67,548, iVf»=63,l52, g,«30,488. 

Standard errors (years) 0*011, 0*013, 0*023. 

18.7. Standard error of mean «= 0*01 5 years. 

18.8. Standard error of quartiles 0*020 years. 

18.9. 1-34270. 

V» 

18.10. €],«!* 36 shillings. Difference of means 2 shiihngs. Difference hardly 
suggestive of real effect. 

18.12. Yes, one might, because the results on farms in successive years are correlated. 

18.13. Mean »5*613; s.e. of mean 0*10. 

Median « 8 * 128 ; s.e. of median 0*21. 

18.14. P«0*309. 

18.15. ;^450.000; ^£1,350,000. 

18.16. 0*12 inch. 


CHAPTER 19 


19.1. Standard error »0* 223 lb. 

On basis of normal distribution »0* 170 lb- 

19.2. O Oll, 0*014. 

19.S. S.e. of ».d.-0-707;^ 

S.e. of Q. “0-787;^ 

19.4« Difference of 8,d.*8 0*2. On the assumption of normality Shmw 0*088. Differ* 
enca might therefore arise, rather infrequently, as sampling fluctuation. 

19 J# *“-0*008 for h^ght distribution, r«-f0*71 for marriage distributioii. * 
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19.6, var 

* w 

var A|« for normal curve. 

W H 

var A4-i{36/t,*(fi4-;t,‘) + 

+ 12/»,{/t4-/t,;t4-4/»,») } 

24(r* 

for normal curve. 

19.7. For the 6th and lower moments. 

19.9, Standard errors are 0-0176, 0-0158, 0-0263, and results might all have arisen 
from an uncorrelated population ; if the population were actually uncorrelated, the 
standard errors 'vould be the same to the number of places given, owing to the 

of f. 

19.10. Standard errors 0-0758, 0- 1308,0-0850, and the correlations are all significant. 


CHAPTER 20 

20,1. x**5-811. v^l, P«0-56, 

20.3. >;*bb4-3, v« 9, P«0*89. The hypothesis seems reasonable. 

20.5. x*“= 27-94, v=a4, Ps=0 000012. The association is significant 

20.6. 0*7080, v=l, Pt= 0-400. The divergences from expectation may well 
have arisen by sampling fluctuations. 

20.7. Use the result that for large n, x* is distributed approximately normally. 

20.8. x*®=27-68, v«»4, Pe=0-00001. The data are very suggestive of association. 

20.1 1, X**" 13* 15, v«=2, P«0- 0014. This is rather low and we suspect the sampling 
to be non-random. 

20.12. x*«*0-993, v«3, P»0*018. Not a very good fit. (In this Exercise the 
last four fr^uendes have been grouped together and v reduced by unit\- to allow for 
the estimation of the mean of the Poisson distribution.) 

20.14. x*‘*^l’4700, ve=3, P«0-943 (by direct calculation). 

20.16. If the total number of births is spread over the period evenly (on the basis 
of number of days in the various months) the theoretical frequencies are 50,349, for 
a month of 31 days, 48,724 for a month of 30 days and 45,476 for February, x*** 
333-9 and deviations cannot be due to chance. 


CHAPTER 21 

21.1. f-a -0-664, v-*9. P«.0- 738. . « . 

The probability that we should get a value of t greater in ahsoMe value is 0*524. 

21.2. The diflewnces in the returns, including cost of manure, have mean«l„ 

1*375, 1-907, v«4. P«0-935. Assuming that distribution of diflerences 

is normal, a greater v'alue would arise about 65 times in 1,000. There is some 
for supposing that the incree^ returns on the better manur^ plot are real, and that 
it would therefore pay to continue the more expensive dressing. 

21.3. Applying the i tost for two samples, 

/•ttO-0991, v»14, P»a0*54 

There is nothing in this test to suggest that populations were unlike as regards height, 
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21.4. irmO^ 1761» v* «9, i^gsaS. The difierence of standard deviations is notttgxuficant 
Coupled with Exercm 21.3, we conclude that there is no ground for supposing the 
two populations different as regards height. 

21.5. Appl3dng the t test for two samples. 

f»2-683. v«4, P«i0-972 

The difference of means is likely to be signiffcant, which supports the suggestion. 

21.6. log. |^-_0-549 <r-^-0-2887 

The observed deviation is suggestive, but not decisive. 

2I.S. P«0*0048. For the standard error formula P«0*0000078. 

21.9. All significant. 


CHAPTER 22 


22.1. The analysis is 




Sum of squares 

d.f. 

Quotient 

Between batches 

44,360 

3 

14,787 

Residual 

151,351 

22 

6.880 

Total 

195,711 

25 

7.828 


0*383 which 

is not significant. 

22.3. The analysis is 




Sum of squares 

d.f. 

Quotient 

Between consignments 

9-71 

5 

1*94 

Between observers 

13*13 

3 

4*38 

Residual 

13*12 

15 

0*87 

Total 

35*9f? 

23 


Differences between observers are significant at the 5% level. 

22.5. All significant. 




22.6. Significantly non-linear. 




22.8. The analysis is 





Sum of squares 

d.f. 

Between investigators 775 


4 

Between areas . 

239 


4 

Residual . 

. 1,175 


16 

Total * 

. 2,190 


24 


Differences are not significant. 


CHAPTER 23 

23.6. (a) 0*0726, (b) 0-0553, {c) 0-0661, (d) 0*0482. 


CHAPTER 24 
24X 0*93877, 0-93823, 0*93822. 

mx 0-828832, 0-818050, 0-817939. The inclusion of the third difference affects 
oidf the lonith place by a single uxdt, so we can probably trust the isismr to Im 
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24.3* Usifkg logarithmic intorpolation* the successive apinoximations axe : 0* 11200* 
0*10044, 0*00963. Seccmd diRereace iuterpolatiou using the last three data only 
gives 0*09359. It looks as if we could trust the figure as about 0*100 or 0*099. 

24.4 4195* 4443. 4724* 5036* 5380. 

24.7. 11*388 approadmatdy. 

24.8. Medtan 4*8924, 4*8869. First decile 1*9474, 1*9572. Ninth decile 8*4286, 
8*3733. As we would probably state such figures only to two decimal places, the 
median would not be appreciably affected by taking second differences into account* 
but the deciles would be slightly corrected. 

24.9. Maximum at 1*336, or day 40, 25th July, value 63*7. 

Minimum at 1*184, or day 35*5, 20th-21st January, value 38*0. 

These estimates are very poor. The maximum is actually 63*4 on 15th-17th July, 
and the minimum 37*9 on 8th>12th January. 


CHAPTER 25 


25.1. Index numbers are 


1923 100 

1927 


m 

1931 

87 

4 101 

8 


90 

2 

81 

5 101 

9 


98 

3 

79 

6 100 

1930 


98 

4 

77 





5 

81 

25.3. Index numbers are 


0) 


(2) 


1930 


100 


100 


1 


81 


102 


2 


75 


90 


3 


71 


91 


4 


74 


95 


5 


75 


97 


6 


79 


103 


25.4. To nearest unit, index is 

102 in 

all cases. 



25.6. Index numbers are 


0) 


(2) 


1935 


100 


100 


6 


101 


100 


7 


109 


110 


8 


103 


105 


9 


106 


107 


1940 


134 


131 


1 


134 


131 


2 


141 


138 


3 


146 


144 



CHAPTER 26 

28.1. thA figuioi are given in Table 27.1, page 640. 

26.3. To tlie nearest unit the first average gives — 

73 (1924), 72, 71, 71, 68. 67. 66. 66. 63, 62, 61. 59, 57, 56. 56. 56, 56, 56. 
64. 52. 49 (1944). 

A second average of these figures gives— 

71 (1928). 70,69,88.86,85. 64. 62. 60. 59, 58. 57, 56, 56, 55. 55. 53 (194 ). 
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26.4. Expressed as a 

percentage of the average monthly rainfall the ffgures are— 



1943 

1944 

1945 

1946 

Jan. 


207 

no 

114 

114 

Feb. 


70 

66 

121 

132 

Mar. 


34 

19 

49 

‘ 56 

April 


66 

104 

80 

85 

May 


139 

65 

139 

130 

June 


94 

102 

131 

139 

July 


77 

98 

91 

108 

Aug. 


96 

96 

84 

161 

Sept. 


134 

165 

98 

193 

Oct. 


86 

113 

103 

38 

Nov. 


77 

175 

23 

178 

Dec. 

• 

54 

71 

105 

102 

26.5. r4«-f 0-735. r. 

+0-367. 1 

-^0-054, r.ea—O- 

102, 

-0-082. \ 

rf** +0-027. 





k 

26.7. The weights of the process 

are 





n [1. 3. 

6. 9, 

12. 13. 



26.8. The index-numbers are — 





1926 

126 


1936 

108 


7 

102 


7 

119 


8 

92 


8 

92 


9 

103 


9 

90 


30 

99 


40 

99 


1 

87 


1 

91 


2 

80 


2 

85 


3 

83 


3 

83 


4 

75 


4 

81 


5 

94 


5 

87 



26.9. As a preliminary show that lor a cubic curve (third differences constant) 
I 

[A] «< ‘ tt* + -“24“ {^+1 ““ + ut^\) 


26.10. The index-numbers are — 

Quarter 

1 3 4 


1928 



120 

118 

9 

117 

115 

112 

109 

30 

104 

99 

94 

89 

1 

86 

84 

83 

83 

2 

83 

82 

80 

79 

3 

79 

79 

80 

81 

4 

81 

82 

82 

83 

5 

83 

64 




26.12 See the hint on Exercise 26.9. 


CHAPTER 27 

27.1. The number ol turning points is 31, almost exactly the expected number 30.67, 

27.2. When the mean-distance is 3. the known result for random series. 

27.3. The mean distances arc 7*28, 4-96 and 4-96 and the aiitiiregressive periods 
are 10 *90, 7*24 and 5*68. respectively. 



ANSWERS TO THE EXERCISES 


689 


27.5. a--l-206. 6«+0*420. 

27.10. The autocorrelations are as follows — 


h 

fh 

k 

n 

1 

0-957 

11 

-0-053 

2 

0-836 

12 

-0-030 

3 

0-660 

13 

-0-012 

4 

0-461 

14 

-0-002 

5 

0-269 

15 

0-003 

6 

0-111 

16 

0-003 

7 

0-000 

17 

0-002 

8 

^0 061 

18 

0-001 

9 

-0-082 

19 

0-000 

10 

-0-074 

20 

0-000 




INDEX 


{The references are to pages. References to Greek letters follow those for Roman letters.] 


Absoluts measures of dispersion, 144 
Accident, death from, 193 
Achenwall, G., footnote, xvii 
Additive property of x^» ^73 
Ages, at death from scarlet fever (Table 
4.11), 89; (Fig, 4.11), 90 
— , at death from all causes (Table 4.16). 
97; (Fig. 4.17), 97 

— , of cows correlated with milk-yield, see 
milk-yield 

— , of husband and wife (Table 9.2), 201 : 
constants, 229 ; corr^ation ratios 
(Exercise 11.2), 279 

Agricultural la^urers' earnings, see 
Earning 

, minimum wage-rates, 128 ; means 

and s. d., 128-130 ; median and m.d., 
139; quartiles, 141 

Agricultural Market Report, data cited 
from, (Table 9 7), 207 
Agricultural Price Index, (Example 25.3), 
603 

Agricultural Statistics, data from, (Table 
13.1), 311; (Table 23.1). 545 
Ammon, 0.. data cited from, (Table 3.2), I 
50 

Analysis of variance, 503-529; for a 
single classification, 503 ; 
relationship with inter-class correlation, 
512; for two-fold classification, 513; 


Asymmetrical frequency-distributions 83- 
^ ; relative position of mean, median 
and mode in. 117. See also Skew^m 
Attenuation, in correlation, 313 
Attributes, generally, 1-18 ; class-fre- 
quencies 3-6; positive, 7-9; consist- 
ence, 9-11 ; incomplete data, 11-14 
Australian marriages, distribution of 
(Table 4.8) 84 ; (Fig. 4.8) 85 ; mean 
and s. d. 132 ; third and fourth 
moments 157 ; fii and fip 160 ; median 
and quartiles 163 ; skewness 163 ; 
kurtosis 164 ; standard error of mean, 
median and quartiles (Exercises 18.6 
and 18.7) 435 ; standard error of s. d.. 
444 ; correlation between errors in 
mean and s. d. (Exercise 19.5), 457 
Auto-correlation, see Serial correlation 
Auto regressive series, 645-658 ; estima- 
tion of constants, 649 ; properties of, 
655-8. See also Correlogram, Serial 
correlation 

Averages, generally, 102-124 ; desirable 
properties of, 103-4 ; forms of, 104 ; 
average in sense of arithmetic mean. 
105 ; See also Mean, Median, Mode 
Axes, principal, in correlation, 242, 323, 
362-3 


significance of correlation ratio, 517 ; of ; 
linearity of regression, 519 ; of multiple 
corrdiation, 521 ; unequal numbers in 
classes, 512 ; three-fold classification, 
523 ; of family budgets (Example 23.7), 
546 

Animal feeding stuffs, index numbers of 
prices of (Table 9.7), 207 ; (Figure 9.4), 
211 ; correlation, 223-5 
Annual values of estates in 1715, (Table 
4.12), 94 ; (Fig. 4.13), 92 
Arithmetic mean, see Mean, arithmetic 
Array, definition, 199 ; type of, 199 ; 8.d, 
of, 221 ; homo- and hetero- scedasticity 
otg footnote 221 ; in normal correlatioa, 
237, 241. 805 

Association, generally, 19-48 ; defiwtion, 
22; testing for, 24-28; coeBcieat 
of. 80; partial. 81-87; illusory. 37-8 ; m 
incomplete data, 38-40; complete 
indeinmence. 40-1 


Babington Smith, B., factor analysis, 
323 ; random sampling numbers, 376. 
Barlow*s Tables of Squares etc,, 56 
Barometer heights, (Table 4.10) 88 ; (Fig. 
4.10)88; means, medians and modes m, 
117; modes of, 583 
Barley, prices of, (Table 25.2) 593 
Base-year, in index numbers, 591-2 
Bateson. W.. data cited from, 29 
Beetles iChrysomelida), sizes of genera 
(Table 4.13), 95 . , 

Bernoulli, James, Binomial distribution, 
169 

Bertrand. JX.F., Quotation on chance. 
374 

Best fit,** see Least Squares 


Beta-function, 494 
Beveridge, Lord, 645 
Bias in sampling, 371-4, 

— ^ in estimatioii, S44-547, 550-3 : tech- 
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nical definition, 547-9 ; cumulative 
effect of, 549 
— , in scale reading, 74 
Biehl, K., data cited from, 315 
Bielfeld, Baron, J. F. von, use of word 
statistics xvi 

Binomial distribution, 169-195 ; genesis 
of, 169-171 ; form of, 172-174 ; con- 
tents of, 174-6 ; mechanical representa- 
tion of, 176; limiting form, 177-181 ; 
Poisson distribution, 139-191 ; in 
sampling of attributes' 386-394, see 
Sampling of Attributes 
Birth-rate^ in local government areas, 70 ; j 
correlation with number of births, 206, | 
constants of distribution (Exercise 9,3) i 
234 ; standardisation of, 333-7 
— of cattle, (Table 26.3), 614 1 

Bivariate distribution, 201 ; normal sur- | 
face, 237-250 ; see also Correlation 
Blackman. V. H., data on duckweed, 350 
Bortkiewicz, L. von, Poisson distribution, ■ 
193 ; 

Breaking-up a group, in interpolation, ; 
571-3 

British Association, data cited from, 
(stature, Table 4.7) 82 ; (freight, Table ’ 
to Exercise 4.6), 100 ' 


Cambridgeshire, mortality in. 561 
Cards, punched, for recording of data, 62 ; 
for sampli^. 375 

CaiToll, Lewis (pseudonym), (Exercise 1.9) 
16 

Cells, in 459 

Census data, see Registrar- General 
Centred averages. 624 
Chance, see Randomness, Probability 
Charlier, C. V. L., in sampling theory, 407 
Chi-square, chi-squared, see x*- 
Cholera and inoculation, 25, 27, 467, 473 
Chrysomelidae, see Beetles 
Circular test, in index-numbers, 602 
Clark, K. D., data from, 194 
Class, in theory of attributes, 2-4 ; class- 
fre<|uency, 3 ; ultimate classes, 5-7 ; 
positive and negative classes, 3 
Class-interval, definition, 70 ; choice of 
magnitude and position, 72-4 ; see also 
Shepard's Corrections 
Classification, generally, 1-2; by dicho- 
tomy, 2-3 ; manifold, 49 ; homogeneous 
59-61 ; as series of dichotomies, 61 ; by 
punched cards, 62 
Closeness of fit, see x* 

Qoudincss at Greenwich, (Fig. 4.15), 93, 
(Table 4,14) 96 ; 

Coefficients of association etc, see under 
Association etc. 

Cofwiex frequency-distributions, 92 
Confiuence analysis, 323 


Consistence, of class-frequencies, 9-1 1 ; of 
correlation coefficients, 301-2. 

Constraints, in Lexis' sense, 408 ; in x** 

461 

Contingency, coefficient of (Pearson's) 53, 
(Tschuprow's) 54 ; isotropy in, 57-9 ; 
relation with normal correlation, 250 ; 
standard error of, 454 
— , tables, definition, 50 ; association in, 
51-2 ; isotropy in, 57-9, 248 ; indepen- 
dence in, 59 : degrees of freedom in, 

462 ; tests of divergence from indepen- 
dence, 467-8 

Corrections, for grouping, see Sheppard’s j 
Corrections I 

— , of correlations for errors of observa- \ 
tion, 328 ; of death-rates, 335-7 \ 

Correlation, generally, 199-339 ; con* ‘ 

structionof tables, 199-201 ; representa- 
tion of tables by diagrammatic methods, 
203-212 ; treatment as contingency, 
212 ; for illustrations see Frequency- 
distributions, Illustrations 

Product-moment coefficient, defini- 
tion, 218 ; lines of regression 214-8 ; 
calculation of, 222-230 ; corrections for 
grouping, 231 ; estimation of, 253-5 ; 
modifiable unit, 310-2 ; attenuation of, 
313-5 ; nonsense correlations. 315-7 ; 
errors of observation in, 328 ; between 
indices, 330-1 ; heterogeneity of 
material, 331 ; standard error of, 451-2; 
significance in small samples, 495-9 
Rank-correlation. 258-268, see Rank- 
correlation ; grade-correlation, 268-9 ; 
tctrachoric correlation, 270-1 ; intra- 
class correlation, 272-7 
— , normal, 237-252 ; linearity of regres- 
sion in, 240-1 ; homoscedasticity in, 
241 ; isotropy in, 248-250 ; relation 
with contingency, 250 ; multivariate, 
303-6. 

— , partial. 281-306 ; generalised regres- 
sions, 282 ; notation, 284 ; expression 
in terms of lower order coefficients, 290 ; 
calculation, of, 290-7 ; expression in 
terms of higher order coefficients, 300-1; 
fallacies in interpretation, 302-3 ; test 
of significance, 451, 495-9 
— . multiple, 281. 361 : coefficient of, 298- 
300 ; significance Of, 453, 521-2 
— . ratios. 256-8 ; relation with goodness 
of fit, ; significance of, 453, 517-9 
— , serial, in time-series, 639 
Correlogram, definition, 651 ; of auto- 
regressive and harmonic series, 651-4 
Cosin, value of estates in 171 5 (Table 4. 12), 

94 

Cost of livii^ index, 596 
— , of electricity, see Electricity 
Covariance, definition, 222 
Coutts. J. R. H,, data cited from (Table 
15.5), 356. 
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Cows, distribution according to milk- 
yield, sBe Milk-yield 

Criminals, weights and mentality, (Table 
3.6). 64. r- 

Crop forecasting, pessiimism in, 544-5 

Crops and . weathe^, correlation of, 
320-1 

Cumulants, definition, 164-5 

Cumulative frequency (distribution) 

function, 144-6 

Curve fitting, generally, 340-363 ; least 
squares in, 342-3 ; equations for, 344 ; 
calculation, 346-8 ; reduction to linear 
form, 348 ; residuals, 360 ; closeness of 
fit, 361 

Curvilinear regression, see Regression 


Darbishire, a, D., data cited from, 121 ; 
(Exercise 17.12), 411 

Datura, association in, 29 ; (Exercise 
20.6), 479 

Davenport, C, B., data cited from, (Table 
9.1), 200 

David, census of Israelites, footnote, xiv 

David. F. N., on correlation coefficient, 
495, 496 

Deaf-mutism, association with imbecility, 
(Exercises 2.1 and 2,15), 43, 46 ; fre- 
quency among offspring of deaf-mutes. 
(Exercise 4.5 (6)), 99 

Deaths or death-rates, association with 
occupation, 39 ; from scarlet fever 
(Table 4.11, Figure 4.11), 89. 90; 
infantile and general mortality, 317-9 ; 
standardisation of. 39, 335-7 ; from 
accidents, 396 ; from explosions in 
mines 406 ; in non-simple sampling, 
^6, 404, 406 ; mortality in Cambridge- 
shire, 561 

Deciles, definition, 144 ; standard error of, 
423 

Defects in schoolchildren, 5-6, 33-5 

Decrees of freedom, in x*- 461-2 ; m 
tuaalysis of variance, 484-5, 505 ; in 
f-test, 487. 

Demoivre, A., discoverer of normal distri- 
bation, 169 

Dependent variable, in regression and 
carve fitting, 282, 345 

Design of statistical inquiries, 370, 530- 
; see Sampling 

Deviance, definition, 504. See Analysis 
of variance 

Deviation, mean, 137-140 ; corrections for 
grouping (Exercise 6.11) 149; lewt 
about median, 138 ; comparison with 
standard deviation. 140; of normal 
distribution, 184 
, quartile, see Quartile 
, root-mean-square, see Deviation, 
Standard 
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Deviation, Standard ; definition, 126 ; re- 
lation with root-mean-square deviation, 
127-8 ; calculation of , 128-133 ; correc- 
tion for grouping, 133-4 ; properties of, 
134-137 ; of senes of natural numbers, 
136 ; of rectangular distribution, 1^ ; 
of arrays in correlation, 221, 240-1, 
243 ; generalised, 284 ; of sum or 
difference, 326-7 ; influence of errors of 
observation on, 327 ; of an index, 329 ; 
of binomial distribution, 175 ; of 
Poisson distribution, 191. See also 
Error, Standard 

Dice, records of throws, (Table 4.15 and 
Figure 4.16), 96; (Exercise 8.2), 197; 
divergence from expectation, 387-8, 389, 
(Exercise 17.1), 409, 466, 470-1, 474-6. 

Difference-method, in correlation and 
time-series, see Variate-difference. 

Differences, in interpolation, 556 and see 
Interpolation. 

Discounts and reserves in American banks 
(Table 9.5, Figure 9.2), 205. 209 

Discriminant functions, 606 

Dispersion, measures of, generally, 125- 
150 ; absolute measures of, 143 ; in 
Lexis’ sense, 407-8 ; see Deviation, 
Mean ; Deviation, Standard ; Range ; 
Quartiles 

Distance-velocity relation in nebulae, 
340-1, 347-8, 362 

Distribution curve, 144-6 ; of frequency, 
see Frequency-distribution. 

Duckweed, correlation in, 227-230; growth 
of, 348-350 

Durant, H. E., on sampling for reading, 
550 


Earnings of agricultural labourers, corre- 
lation in, (Exercise 9.2) 233 ; partial 
correlation, 290-3, (Figure 12.1) 297 
Economy in variables, 322 
Edgeworth, F, Y., data on dice-throwing, 
(Table 4.15). 96 
Efficient estimates, 475 
Egg-prices, index-numbers of, 625 
Electoral voting in English municipalities, 
(Table 17.1) 402 

Electricity Commission, data from returns 
of, (Table 15.4), 353 

Electricity, costs and numbers of units of, 
(Table 15.4), 353, 350-2 
Elimination of seasonal effects in time- 
series, 624-5 

Engledow. Sir Frank L., data from, (Table 
22.5), 509 ... 

Error function, see Normal Distribution. 
Error, mean, 137 

137 

17, 390 i see Error, 


Error, mean-square, 
Error, probable, 1 
Standard 
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fiptor* Standard, definition 390, 421 ; oi 

miinber or proportion of successes, 387 ; 

' ifrlien sample^numbers vary (Exercise 
17.11), 411 when chance of success is 
small, 393; of percentiles, quartiles 
etc., 423 ; of senu-interquaxtile range, 
427 ; of arithmetic mean, 428 ; of 
variance, 442 ; of standard deviation, 
442 ; of coefficient of variation, 448 ; of 
moments about a fixed point. 437-9 : of 
moments about the mean, 440 ; of third 
and fourth moments about the mean. 
447 : of fii and 450 ; of coefficients 
of correlation and regression. 451-3 ; 
approximate formulae for correlation 
ratio and multiple correlation, 453 ; of 
coefficient of association, 454 ; of 
mean-square contingency, 454 ; of 
Spearman's p, 454 ; of Kendall's r. 455. 
See also Sampling, Theory of 
Error. Theory of, see Sampling. Theory of 
Estates, value of, see Value 
Estimates, precision of, 369 ; efficient. 
475 ; in small samples, 482 ; of arith- 
'metic mean, 482-3 ; of variance, 484 ; 
degree of freedom of. 484-5 
Estimation, Theory of, ^9 ; of theoretical | 
frequencies in x* 474-5 ; of | 

position of maximum, 582 ; of con- 
stants in autoregressive series. 649 
Examination of samples. 530, 544-552 
Existent populations. 367 
Explosions in coal-mines, deaths from. 406 
Eye-colour, association of. father and son, 
26-7, (Exercise 2.4), 44 (Table 3.4). 58-9 
of husband and wife (Exercise 2.5), 44 ; 
with hair-colour (Table 3.2), 50 


VACTOR-analysis. 323 
Factor reverb test, in time-series, 601 
Fallacies, in interpreting associations, 37-8; 
due to change in classification. 60-1 ; in 
interpreting correlations. 302-3 ; spuri- 
ous correlation. 330-1 ; due to hetero- 
gmieity, 331-2 ; nonsense-correlations. 


315 


Family budget data in Nagpur. 546-7 
Farm survey, estimation by sampling 
fraction. 537-8 

Fay, E. A., data from, (Exercise 4,5 (5) ), 


Fecundity of brood-mares, (Table 4.9, 
Figure 4.9), 86, 87 

Finite populations. 367 ; variance of 
proportion from, 405 

Fiilw, Irving, 601 

Ft4^, K, A., Tables of ^65 ; limiting 
mmsiahty of l Tables of /, 488- 

; distribution of variance-ratio, see 
iUher^S distribution; distribution of 


oomdation coefficient, 495 ; tians- 
formation, 497 

Fisher's distribution (r-distribution) 493 ; 
in analysis of variance, 506 ; for large 
numbers of degrees of freedom, 512 ; see 
also Analysis of Variance 
Fit, goodness of, see x* 

Fitting, of curves; see Curve fitting 
Flying bombs, distribution of, 194 
Fc^, Drink and Tobacco trades, sizes of 
firms in, (Exercise 4.5 (a ) ), 99 
Footrule," Spearman's, footnote, 262 
Forecasting of crop-yields, 544-5 
Fourier analysis see Harmonic analysis 
France, Anatole, xiv 

Freedom, degrees of, see Degrees of 
Freedom 

Frequency-curve, 80-1 ; ideal forms of, 81, 
84, 91, 93 : Pearson's, 194-5 
Frequency-distributions, generally, 69* 
101 ; magnitude and position of class* 
intervals, 72-4 ; graphical representa- 
tion, 78-81 ; common types of, 81-92 ; 
symmetrical, 82 ; skew, 83-7 ; J- 
shaped, 87-90 ; U-shaped, 90-1 ; 
truncated forms, 91-2 ; complex forms, 
92-4 ; pseudo-forms 94-8 ; reduction to 
absolute scale, 144 : cumulated sum 
(distribution curve) 144-146 ; theo- 
retical forms, 169-198. See Normal 
distribution, Binomial distribution, 
Poisson distribution. Correlation, 
Bivariate, Multivariate distribution. 
Frequency-distributions, illustrations : 
birth-rates in local government areas 
(Table 4.1), 70 ; capsules of poppies 
(Table 4.2), 71 ; lengths of screws 
(Table 4.3), 72 ; final digits in measure- 
ments (Table 4.4), 74 ; persons liable 
to surtax and super-tax (Table 4.5), 77 ; 
headbreadths of students (Table 4.6), 
78 ; statures of men (Table 4.7). 82 : 
marriages in Australia (Table 4.8), 84 ; 
fecundity in brood mares (Table 4.9), 
86 ; barometric heights (Table 4.10), 88; 
deaths from scarlet fever (Table 4.11), 
89 ; values of estates (Table 4.12), 94 ; 
beetles (Table 4.13), 95; cloudiness 
(Table 4.14), 96; dice-throws (Table 
4.15), 96 ; male deaths (Table 4.16), 
97 ; size of firms in Food, 
Drink and Tobacco Trades (Exercise 
4.5 {a) ), 99 ; deaf-mutes (Exercise 4.5 
(6) ), 99 ; yield of grain (Exercise 4.5 
{c) ), 99 I petals in buttercups (Exercise 
4.5 (d) ), 100 ; weights of men (Exercise 
4.6), 100 ; deaths from horse-ktek, 193 ; 
flying bombs, 194. 

Diameters in shell fish (Table 9.1), 
200 ; age of husband and wife (Table 
9.2), 201 ; statures of father and son 
(Table 9.3), 202 ; age and mllk*yidid of 
cows (Table 9.4), 204 ; diseemnt mtto 







t^h-rate 9M lUiimMi of oittoi (Tilutt 
9.6), 206 : fronds of teas (TsUo $,9), 
226 ; dectmats voting in niinacipalitet 

(Tabls 17.1J. 402. 

Frequency-polygons, 78-80 
Frequency-surface, ree Bivariate Distribu- 
tions 


B^sch, R,^ coxiflu6iiC6 luialysis, 323 
Fundamental sets, specifying data, 6 


^rancis, ogive, 145; bino- 
n»i«l ajppwatus, 176; regr^on, 213; 


- , 

jgxercises 2.4 and 


data cited from, 26. 

2.5), 44, (Table 3.4). ^ 

^ 4 ^' distribution. 169 

oi term mean error,'’ 137 
^hlke. C. E,. data cited from. 315 
Gwmctnc me^, Mean, geometric 
Gmi,^ C.. coefficient of mean-difference, 

Goodness of 5t, see 
Gosset, W, S., see ''Student.'* 

Grad^, 144 ; grade correlation, 268-270 * 
see Quartilcs * w . 

Gi^uation, 575-9 Sse InterpolaOon 
Graphical methods, of representing fre- 
qneniqr-distribution. 78-80 ; of inter- 
polating for quartiles, 113-4 ; of rep- 
J^ting cwelat^ (scatter diagram). 
411-2; of estimatmg correlation, 253-4 
Gray, J., data cited from, 398 

‘^^Sr8.3)^l-75*‘** 

toup, • 

571-5 


Group. breaJcing-up of, in interpolation 


Grouping-corrections see Sheppard's 
corrections 

Grouping of observations, in freqnency- 
di^bntions. 71-5; in correUtions. 

Growth of dndcweed, 348-350 


contingency, (Table 
6.3), 55 ; non-isotropy 
M, 56-7 ; in school-girls, 398, 400 
"■“-“yMtants. see Cnmnlants 

, e 7*f {Exercise 

4.5 (e)), 99 

* Ponp. in interpolation, 578-5 
Hannan, H., Factor Asuuysis, 323 
Hsnnonic analysis. 641-5 
Hsm^ mean, see Mean, Harmonic 
H^-breadtts of students. (Table 4.1 
l^res 4.1, and 4.2). 78.9 
HCil^t, of men, eee Stature 

f*'****® 16-1. Figure 
16.1), 372 ^ 


Itetogniit, 7S40 

T,, data cited fron, (Talde 442), 

Holaager, K, J,, Fackr Analysis, 3S8 

Hoi^eda«ticity, footaot^ 221 

**;J“vestigation into weaQur 

6®7 ■ from.^. 

H^ble, E . data cited from. (Table 15.1), 

™ 371-4 530-2, 

Husbanik and wives, correlation between 
^230*^r* ' ®®“stnnts of. 


Ideal index-number, 601 
Illusory associations, 37-8 
Incomes, see Surtax 

Independence, of attributes, 19-22 ; com- 
plete, 40-41 ; in contingency tables, 51. 
213’ 467-9, 472 ; of variables, 

variable, in curve fitting, 


Index-numbers, generally, 590-609 ; price 
mdex-numbeis, 592-4 ; geoxnl^c 
means, 595-9 ; time-reversal test. 599- 
601 ; factor-reversal test, 601 ; "ideal" 
number, 601 ; circular test, 602 • 
linldng methods, 603-4 ; quantum 
indices, 605-6 ; of animal feeding stufis 
^d oats, (Table 9.7, Figure 9.4), 207, 
211, correlation, 223-5 ; of egg-prices, 
625-7 ; of wheat-prices, 645 (Figure 
27.1), 644 

Indices, correlation between, 330 
Infinite populations, in sampling, 367, 380 
Inoculation against cholera, see Cholera 
— , against tuberculosis in cattle, 472 
Intensity, in periodogram, 642 
Interac^ns, in variance analysis, 524 
Interim index of retail prices, 596-8 
Interclass correlation, 273 
Inteipolation and graduation, generally, 
555-589 ; differences, 556-8 ; Newton’s 
formula, 558-561 ; of statistical series, 
561-7 ; effect of errors on differences, 
567-571 ; subdividing an interval, 571 ; 
breaking up a group, 571 -5 ; graduation, 
575-0 ; inverse, 579-582 ; estimation of 
a maximum, 5^-9 
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Interval, subdivision of, 571 
Intraclass correlation, 272-7 ; relation 
with variance-analysis, 512-3. 

Inverse inteiyolation, 579-582 
Isotropy, definition, 57 ; generally, 56-9 
of normal distribution, 248 
Isserlis, L., on index-numbers, 603 


J-SHAPED frequency-distributions, 87-90 
Jute, sampling of, 531, 545-6 
Juvenile delinquency, 315 


Kelley, T. L., Statistical Tables, 297 
Kelvin, Lord, dictum on measurement, xiii 
Kendall, M. G., on factor analysis, 323 ; 
random sampling numbers, 376 ; rank 
correlation, 455, 456 ; data from, 
(Table 27.2), 647, (Table 27.3), 648; 
peaks in time-series, 657 
Kick of a horse, deaths from, 193 
King, G., graduation of age statistics, 577 
Kurtosis, definition, 164 ; of binomial, 
175^ ; of normal, 183 ; of Poisson, 192 ; 
efiect on standard error of standard 
deviation, 443-4 


Labourers, agricultural, see Agricultural, 
Earnings 

Lanarkshire milk experiment, 543 
Laplace, P. S., Marquis de, normal distri- 
bution, 169 

Leading term and leading differences, 557 
Least squares, method of, in regression, 
216-7, 282-4 ; in curve fitting, 343-5 
Lee, Alice, data cited from, (Table 4.9), 
86; (Table 9.3), 202 

Lemna minor, correlation in, (Table 9.9) 
226 ; growth in, 348-350 
Leptokuitosis, 164 

Levels of significance, in ^ test, 471-2 ; 

in f-test, 488 ; in ^-test 510-2 
Lexis, W., use of term “ dispersion,’* 
407-8 

Linear constraints, 451 
Linearity of regression, see Regression 
Linking methods in index-numbers, 603 
Little, W., data cited from (Exercise 9.2), 
233 

Lloyd’s Register, data cited from, (Table 
26.2), 613 

Loss in weight of soils, see Percentage 
Losses of ships, (Table 26.2, Figure 26.2), 
613-4, 639 

Lottery sampling, 375 


Macdonell, W. R., data cited from (Table 
4.6), 78 

Mahalanobis, P. C., data cited from, 529, 
531, 546, 549 

Manurial treatments, 515, 524 
Marley, Joan, data cited from, (Table 
26.3), 614 

Marriages, Australian, see Australian 
— , age at, (Table 22,2), 507 
M^imum, estimation of position of, 582 
Mean, arithmetic, generally, 104-111 ; 
calculation of, 105-8; properties of, 

1 10-1 ; relation with mode and median, 
117; of sum on difierence, 110-1; 
reciprocal relation with harmcmic mean 
121 ; of binomial, 174 ; of Poison, 191 ; 
weighting of, 332-7 ; standard, error of, 
428 ; means of two samples, 429-430 ; 
/-test for, 487-492 ; estimates of, 482-3 
Mean deviation, see Deviation, mean 
— .difference, 146-7 
— .error, 137 \ 

-, geometric, 118-120; weighting of, 
337 ; in index-numbers, 595-6, 598-9 
harmonic, 120-1 ; relation with 
arithmetic mean, 121 ; in sampling 
theory (Exercise 17.11), 411 

— square contingency, see Contingency 

— square error. 137 

— , weighted, 332-7 ; in death-rates etc., 
335-6 

Median, generally, 111-116; determina- 
tion of, 112-4 ; comparison with mean, 
114-5; advantages of, 115-6; relation 
with mean and mode, 117; standard 
error of, 421-6 

Mendelian breeding experiments, 29, 121, 
389 

Mental defectives, relation with radio 
licences, (Table 13.2), 315 
Mentality, relation with weight in 
criminals (Table 3.6), 64 
Mercer, W. B., data cited from (Exercise 
4.5 (c) ), 99 

Method of least-squares, see Least squares 
Mice, numbers in litters, 121, (Exercises 
17.12 and 17.13), 411 

Milk-yield in cows, correlation with age 
(Table 9.4), 204 ; (Figure 9.9). 219 ; 
constants of (Exercise 9.3), 235 ; 

correlation ratios (Exercise 11.1),) 278 
Milton, John,, use of word ” statist ”, xvi 
Mode, generally, 116-7; relation with 
mean and median, 117; estimation of, 
582-3 

Modifiable unit. 310-3 
Modifying central ordinates, 583-4 
Modulus, as measure of dispersion, 137 
Moments, first, definition, 106; 
definition, 127 ; generally* 151-160 ; 
about mean in terms of those about any 
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bivariate distributions, footnote, 222; 
standard errors of, 437>442 ; correlation 
between errors 441-2 

Moore, L. B*, data cited from, (Table 4.9), 
86 

Mortality, see Death-rates 
Moving averages, 617-624 ; see Trend 
— , weights, in index-num^rs, 603 
Municipal elections, (Table 17.1), 402 
Multiple ''correlation, see Correlation, 
multiple 

Multivariate analysis, 323 


National Income. (Table 26.7), 628 
Newbold, Ethel M., partial correlations, 
306 

Newton's formula, in interpolation, 558- 
561 ; binomial coefficient in (Table 
24.4), 564 

Nonsense correlations, 315-7 
Normal dispersion, in Lexis' sense, 407 
Normal distribution, as limit of binomial, 
177-181; properties of, 181-3; constants 
of, 183-4 ; ordinates and areas of, 
1^-6 ; as an error distribution, 186-7 ; 
occurrence of, in nature and theory, 
187-9 ; normality of sampling distri- 
butions, 485-6 

Norton, J. P., data cited from (Table 9.5), 
205 


Oats, correlation of prices with those of 
home-grown feeding stuffs, (Table 9.7) 
207, (Figure 9.4) 211, 223-5; price 
index-numbers of, (Table 25.2) 593 
Ogbum, W. F., data cited from, 322 
Ogive, Galton's, 145 ; see Distribution 
Curve 

Order statistics, 260. See Rank Correla- 
tion 

Oscillations in time-series, 614-624 ; 
effect of moving averages on, 629-631 ; 
generally, 637-658 ; serial correlation, 
639-641 ; periodo^am analysis, 641-5 ; 
autoregressive series, 645-651 ; correlo- 
gram, 651-654 ; “ periods " of, 656-8 
Orthogonal polynomials, 357 
Oaculatory mteipolation, 579 


pARABatas, fitting of, 341 ; see Curve 
fitting 

Parameters, definition, 414 
Partial association, see Association, partial 
— , conation, see Correlation, partial 
rank correlatson, 264, 306 


Pauperism, correlation of, (Exercise 9.2), 
233 ; 290-3. 297 

Peak, in time-series. 638 ; mean-distance 
in autoregressive series, 656-7 
Pearce, Gertrude E., data cited from 
(Table 4.14), 96 

Pearson. Karl, contingency, 53 ; correction 
to coefficient of contingency, 54 ; 
definition of ^'s. footnote, 164 ; bino- 
mial apparatus, 176 ; system* of 
curves, 94-5 ; normal correlation and 
contingency, 250; data cited from. 
(Table 3.4), 58, (Exercise 3.1), 65, 
(Table 4.9), 86, 117, (Table 9.3), 202 
Pearson curves. 94-5 
Peas, experiments in crossing; 389 
Pecten, correlation between two diameters 
of shell, (Table 9.1), 200 ; constants of, 
(Exercise 9.3), 235 

Percentage loss of weight in soils, (Table 
15.3), 356 ; curve fitted to, 352-6 
— , standard error of, 387 
Percentiles, see Quantiles 
Period, of time-series, see Oscillations 
Periodogram, 641-5 ; see Wheat-prices 
Pessimism, in crop forecasting. 544-5 
Petals, of buttercup, (Exercise 4.5 (d) ), 
100 ; unsuitability of median for, 112 
Phase, in time-series, 638 
Platykurtosis, 164 

Poisson distribution, 189-194 ; constants 
of, 191-2 ; in sampling, 393 
Polynomials, in curve-fitting, 341-4 ; 
orthogonal, 357 ; differences of, 556 ; 
forms of, in interpolation, 566 
Poppies, stigmatic, rays of. (Table 4.2), 71; 

unsuitabiUty of median for, 112 
Population, statistical, footnote, 1 
— , estimation of, between censuses, 120; 

curve fitted to, 358 
Positive classes and attributes, 7-9 
Potatoes, yields of, (Table 13.1), 311 ; 
515-6; (Table 23.1), 545, (Exercise 

27.1) . 659 

Precision, 137 ; of estimates, 369 ; varies 
as square-root of sample number, 394 
Prest, A. R., data cited from, (Table 26.7), 
628 

Pretorius, S. J., data cited from, (Table 
4.8). 84, (Table 4.10), 88 
Price-level, effect of change in. 627 
Price-relatives, 591 

Prices, index-numbers of, 592-606, see 
Index-numbers ; use of geometric mean 
in, 120 

Principal axes, in correlation, (Figure 

10.1) , 240 ; in curve fitting. 361 
Probability, 369 ; 415-7. See Sampling 
Probable error, see Error, stondard 
Pseudo frequency-distribution, 94-5 
Punched cards, recording of information 

on, 62-3 

Purposive sampling, 369, 382-4 
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Qdauty control. Qse of range in. 125 
^antiles,^ 144 ; standard error, 421*^ 
^antum index-numbers, 505-6 
W^wuiaile deviation, m Quartiles 
Quartiles, definition. 140 ; deviation. 142* 
^pineal relation with standard devia- 
tu^l42-3 ; graphical determination of, 
144-6 ; in measuring skewness, 160-1 • 
^nom^ ^sWbution. 186 ; standard 
errors of, 421-5 

(Exercise 17.2), 409 
Quota sampling. 542 


R^dom element, in time-series. 614; 
effTOt of trend-elimination on. 630-1 ; 
645-6 * 

RMdom sampling. 374-384 ; technique of, 
^-e ; random sampling numbers, 
376-9 ; importance of. 381-2 
Rwdomness. tests for, in time-series, 638- 
641 

Range, as measure of dispersion, 125 

260-270 ; Spearmaa*so 
* ^^ndall's r, 262-4 ; tied ranks, 
^-6 : relationship with product- 
moment correlation, 269-270; partial 
collation, 306 ; standard error of p, 
454, of T, 455-6 ; f-test of, 454, 493 
Ranunculus bulbosus, see Petals 
Rmstrar-General. standardisation of 
death-rates, 336; data cited from 
reports of : death-rates of occupied 
n^es. 39 ; blindness and derangement 
(Exercises 2.6 and 2.15). 44; 46 

» housing 

(Table 3.5) 63 ; birth-rates, (Table 4.1) 
P •* deaths from scarlet fever (Table 
husband and wife, 
201 : birth-rates (Table ; 
9.6), ; general and infantile mortal- 1 

population I 

(T^e 15^), 359 ; voting in municipal 
(Table 17.1), 402 ; expectation 
of life, 581 

R^wssion, generally, 213-231 ; curves of. 
213 ; c oefficients of, 221 ; calculation 
of, 222-230 ; in normal variation, 241 * 
non-iin^T, 213, 255-6 ; multiple varia- 
281-306; partial regressions, 
ail;;6 ; in terms of higher-oider 
300; in terms of lower- 
orto coomcdnts, 289 ; in wheat-yields 
And iv^ther. 320-2 ; economy in 
^ber of ^bles, 322-3; standard 
error of, 453 ; significance in smalt 
492<3; test of linearity of, 

l^Actoi, O, on confluence analysis, 323 


discount in American banks. 
(Table 9.5, Figure 9.2), 205, 209 
Reuduals, 343, see Least Squares 
Rider, P.R. data cited from, 414 
Room space, deficiency in, (Table 3.5), 63 


SAMPLmo fractions, 533-9 
Sampling numbers, see Random sampling 
Samph^, generally, 366-534; typ^ of 
of significai 
70-1 ; rand 
j ^ » technique 

random, 374-6 ; random samplikv 
numbers. 376-380; from infinite popuA- 
frons, 380 ; from hypothetical popular 
tions, 380-1 ; purposive, 382-4 
— , of attributes, 386-412 ; simple. 386-7 < 
mean and s. d. in, 387-390 ; standard\ 
*5*^®*^' >' ®*se where parent propor- 

tion unknown, 390-4 ; limitations of 
394-6; aMlicatiwis, 
396-400 ; non-simple, 400-7 ; Lexis' 
approach. 407-8 

A l^ge samples, generally, 

415-458 ; sampling distribution, 414-9 ; 
simple sampling. 419-420; approxi* 
mations, 420-1 ; standard error, 421 ; 
of quantiles, 421-6; of semi-inter- 
quartile range, 427-8; of arithmetic 
428; means of two samples, 
4Z9-3U : non-simple sampling, 430-3 ; 
stendard errors of moments, 437-442 ; 
of variance, 442 ; of standard deviation, 
442-6 ; two samples, 446 ; of moments, 
fix*® * of coefficient of variatioa, 448- 
450 ; of and fip 450-1 ; of correla- 
tion coefficient, 451-3; of regression, 
453 ; of correlation ratio and multiple 
correlation coefficients. 453 ; of con- 
cent of association, 454 ; of coefficient 
of contingency, 454 ; of Spearman’s p. 
454-5 ; of Kendall's r. 45$^ 
of variables, small samples, 482-502 ; 
primates, 482-4 ; degrees of freedom of, 
484-5 ; tests of significance. 485 • 
assumption of normality, 485-7: f- 
distribution, 487-492; ri^iflouKi of 
r^mions, 492-3 ; Fisher's distributiott 
493-5; correlation coefficient, 405-9; 
correlation ratio, 517-9; finearity of 
regression, 519-520 ; multiple conela- 
tion cocfficent, 521-2. See ais 0 Analy- 
sis of Variance 

practical problems, 530-554 ; iise of 
tinit, 530 - 2 ; stratified sampling, 583 ; 
sampling fractions, 533-9 ; syitematic 
sampling, 542 ; quota samptli^ 542 ; 
SMuentiai samplitig 543 ; exanunatioti 
of samples, 544 ; fmim* 
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ism, 544*5 ; duplicated enumeration, 
545-7 : bias, 547*550 ; the vanity effect, 
550 ; the sympathy effect, 550-1 : 
methods of minimising distorted res- 
ponse, 551-2 

Saunders, Miss £. R., data cited from, 29 
Scale reading, bias in, (Table 4.4), 74 
Scarlet fever, deaths from. (Table 4.11, 
Figure 4.11), 89, 90; mean, 108; 
mMian, 113 

Scatter diagram, 211-2 ; generalised, 297 
Scottish Milk Records Association, 453 
Screws, measurements on, (Table 4.3), 72 
Seasonal effects, in time series, 624-7 
Semi-inter(|uartile range, su Quartiles 
Semi-invariants, seminvariants, see 
Cumulants 

Sequential sampling, 543 
Serial correlation, 639 
Shakespeare, W., use of word ** statist 
xvi 

Sheep population, (Table 26.1), 612; 
(Figure 26.1), 613 ; trend line fitted to, 
6^2, (Figure 26.5), 622 ; variate- 
differences of, 632-3 ; residual after 
trend-elimination (Table 27.1), 640; 
serial correlations (Table 27.4), 650; 
correlogram (Figure 27.4), 650 ; auto- 
regi^ive scheme for, 653-4 ; residual 
variance, 655 

Sheppard, W. F., corrections for grouping, 
133-4, 158-9 : theorem on norm^ 
correlation, (Exercise 10.4), 252 
Shipping-freights, index-number of, 603-4 
Significance levels, see Levels of Signific- 
ance 

Silvey, R. J., on sampling for radio 
audition, 550 

Simple interpdation, 559-561 
Simple sampling, see Sampling of Attri- 
butes, Sampling of Variables 
Sinclair, Sir John, use of words 
statistical '' statistics xvii 
Size of sampling unit, 530-2 
Skew frequency-distributions, 83-7 
Skewness, 83-7 ; measure of, 162-3 ; 
standard mror of Pearson's measure of, 
450 

Small diaitces^ see Poisson distribution 
— samples, see Sampling of variables, 
snum samples 

f^tionship between temperature 
loss of weight, 352-7 
Southey, H., (Table 4.12), 94 
Spahltnger vaccine for tuberculosis in 
cattle. 472 

Speannaii^ C., theorems on correlation, 
327-8 ; footfttle," footnote, 262 ; see 
Hank correlation 

Spmic«r% formulae lor graduation, 623 
Spurious corr^ation in indices, 330 
Standard deviatkm, rss Deviaticni, 


Standard error, see Error, standard 
— error of a particular statistic, see under 
that statistic or under Error, standard 
Standardisation of death-rates, 335-7 
** Statist," occurrence of word in Shakes- 
peare and Milton, xvi 
" Statistic," definition, footnote, 414 
Statistical series interpolation, of, 561-3 
Stature, correlation in father and son 
(Table 9.3), 202; (Figure 9.3), 210; 

regression lines (Figure 9.8), 218 ; 
constants of (Exercise 9.3). 235; correla- 
tion ratios. 258-9 ; test for normality, 
243-8 ; test for isotropy, 249-250 ; 
standard error of correlation, 452 
Stature of males in the United Kingdom 
(Table 4.7), 82; (Figure 4.7), 83; 

mean, 106-7; median, 112-3; means 
and medians of constituent countries 
(Exercise 5.1), 122 ; standard deviation, 
131-2 ; mean deviation, 139-140 ; 

quartiles, 141-2 ; s. d. and m. d. of 
constituent countries (Exercise 6.1), 
148 ; third and fourth moments, 

153-5, 159 ; and 160 ; skewness, 
162 ; kurtosis, 164 ; cumulants, 165 ; 
normal curve fitted to (Figure 8.3), 189 ; 
standard errors of mean, 428 ; of 
median, 425-6 ; of deciles, 426 ; of 
standard deviation, 444 ; of third and 
fourth moments, 447-8 
Stigmatic rays in poppies, see Poppies 
Stirling, James, approximation to 
factorial, 179 

Stratified sampling, 371, 382-4, 533-542 
" Student " (W. S. Cosset), mnemonic for 
kurtosis, 164 ; standard deviation of 
Spearman’s p, 455 ; on Lanarkshire 
milk experiment, 544 
'* Student ^s " distribution, see f-distribu- 
tion 

Sub-division of intervals, in interpolation* 
571-5 

Subnormal dispersion, in Lexis* sense, 408 
Sugar beet, determination of sugar 
content. 383-4 

Sunspots, oscillations in W'olfer's numbers 
(Table 26.4), 615 ; (Figure 26.4), 616 ; 
as autoregr^ive series, 656 
Supernormal dispersion, in Lexis* sense, 

Sur- and super- tax (Table 4.5), 77 ; 

quantiles (Exercise 6.3), 149 
Sympathy effect, in sampling, 550-1 
Systematic sanlpling, 542 


t-pisTiiiBUTioN, 487-8; applications, to 
testing a mean, 489; comparison of 
two means. 490-2 ; regreaBion 
coeffidents, 492-3 ; test of Spearman^s 
p, 455 ; teat of product-moment 
cortdatimis, 499 
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TsbtiistiOtt of data; 4. 59-61. 72-7, 2014 
Tangea^l inter^iation. 579 
Tampeiature and loss ^ weight in soil, 
SM Percentage 

Tests of significance, su Sampling of 
variables, small samples 
Tetmchoric r, 270-1 ; difierent from 
product-moment, 272 ; standard error 
of, 452 

Thiele, T. N., second footnote, 164 
Ticket sampling, 375 
Tied ranks, 264-6 

Time-revenml test, in index-numbers, 
590-2 

Time-series, generally. 610-661 ; examples 
of, 611-6; trend, 616-7; moving 
averages, 617-624 ; elimination of 
seasonal effects. 624-7 ; effect of trend- 
elimination, 627-631 ; variate differenc- 
ing, 631-3 ; tests for randomness, 638-9; 
serial corr^tion, 639-641 ; periodogram 
ana! 3 rsis, 641-5; autoregressive series, 
645-651 ; correlogram, 651-4 ; proper- 
ties of autoregressive series, 654-6; 
period of an oscillation, 656*8 
Tippett, L. H. C., sampling numbers, 376 
Tocher, J. F., data cited from, (Table 9.4), 
204 : correlation of milk-yield and 
butter fat,, 452 

Trend, 616 ; determination by moving 
averages, 617-624 : effect of elimination 
on harmonic component, 629 ; on 
random component, 630 ; variate- 
differences, 631-3 
Trough, in time-series, 638 
Truncated frequency-distribution, 91-2 
Tschuprow, A. A., coefl5cient of contin- 
gency, 54-6 

Tuberculosis in cattle, vaccine for, 472 
Turning-point, in time-series, 638 
Type, of array, 199 


Variation, coefficient of, 143-4 ; standard 
error of, 448-450 

Velocity-distance relation in nebulae, 
(Table 15.1), 340; (Figure 15.1), 341, 
347-8 

Volume of exports, index-number of, 606 


Wages, of labours, s^e Agricultural 
Labourers, Earnings 
Wald, A., sequential sampling, 543 
Weather and crops, correlation, 320-1 
Weight of criminals, (Table 3.6), 64 i 
Weight of males in the United Kingdom, 
(Exercise 4.6), 100 ; mean, median and 
mode (Exercise 5.3), 122 ; s.d., m.d.< 
quartiles (Exercise 6.2), 149 ; moments, 
fig, and skewness (Exercises 7.1 
and 7.2), 167 ; standard error of mean 
(Exercise 18.5) 435 ; of median and 

quartiles (Exercise 18.4) 434 ; of 

standard deviation (Exercise 19.1), 457 
Weldon, W. F. R., see Dice 
Wheat, yields of (Table 1 3.1 ) 311 , 493, 494 ; 
prices (Table 25,1), 591 ; Beveridge 
price^ndex, 645, (Exercise 27.8), 660 
— shoots, distribution of (Table 16.1), 
372 

Whitaker, Lucy, data cited from (Exercise 
8.17), 198 

Whiting, Madeleine H., data cited from 
(Table 3.6), 64 

Wholesale prices, index-number of, 598-9 
Willis, J. C., data regarding Chrysomelida, 
(Table 4.13), 95 

Wireless licences, see Mental defective 

Wolfer’s .sunspot numbt'rs, (Tabl<‘ 26.4«, 

615 

Woo, T. L., data cited from (Exercise 3.10), 
67.8 


Ultiicats classes, 5-6, 7-8 i 

Undertakings, Electricity, see Electricity I 
Unit, size of, in correlation, 310-3 ; in 1 
sampling, 531-2 ! 

U-shaped distributions. 90-1, 93 i Yates, F., data cited from (Table 16.i) 

372 ; on farm survey, 536-8 ; Sampling 
methods, footnote, 542 
, Yields, of grain, (Exercise 4,5 (<:) ), 99 ; 
Value of estates, (Table 4.12), 94 ; | (Table 22.5), 509 ; of * potatoes, 

(Figure 4.13), 92 j (Exercise 27.1), 659 ; of milk, see milk- 

Vanity effect, in sampling, 550 yields 

Variables, ^eory of, generally, 69ff ; j Yule, G. Udny, passim ; data cited |rom, 
sampling of, see Sampling of variables ! cholera, 25-6, 27-8 j poppies (Table 

VaHance, diction, 127 ; standard error ; 4.3), 71 ; reading a scale (Table 4.4). 

of, 442; esrimates of , 483 ; Analysis of, ^ 74; (Table 4J3), 95 ; duckweed (TaWe 

IS# Analysis M), 226; experiments on 476; 

Variide, dt^Uem, footnote, 69 judgment of tint (Exercise 20.5), 479 ; 

Vafiate-difference method, 317-9 ; in correlation, 520 ; sunspot numbers 

dilnnttining moving averages, 631-3 (Table 26.4), 615 



